The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.
It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.
Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.
# Installing the libraries with the specified version.
# uncomment and run the following line if Google Colab is being used
# !pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 yellowbrick==1.5 -q --user
# Installing the libraries with the specified version.
# uncomment and run the following lines if Jupyter Notebook is being used
# !pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.2 yellowbrick==1.5 -q --user
# !pip install --upgrade -q jinja2
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
# Load the dataset
df = pd.read_csv('/content/stock_data.csv')
1. Preview at the first few rows of the dataset
# Preview the first 5 rows of the dataset
display(df.head())
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAL | American Airlines Group | Industrials | Airlines | 42.349998 | 9.999995 | 1.687151 | 135 | 51 | -604000000 | 7610000000 | 11.39 | 6.681299e+08 | 3.718174 | -8.784219 |
| 1 | ABBV | AbbVie | Health Care | Pharmaceuticals | 59.240002 | 8.339433 | 2.197887 | 130 | 77 | 51000000 | 5144000000 | 3.15 | 1.633016e+09 | 18.806350 | -8.750068 |
| 2 | ABT | Abbott Laboratories | Health Care | Health Care Equipment | 44.910000 | 11.301121 | 1.273646 | 21 | 67 | 938000000 | 4423000000 | 2.94 | 1.504422e+09 | 15.275510 | -0.394171 |
| 3 | ADBE | Adobe Systems Inc | Information Technology | Application Software | 93.940002 | 13.977195 | 1.357679 | 9 | 180 | -240840000 | 629551000 | 1.26 | 4.996437e+08 | 74.555557 | 4.199651 |
| 4 | ADI | Analog Devices, Inc. | Information Technology | Semiconductors | 55.320000 | -1.827858 | 1.701169 | 14 | 272 | 315120000 | 696878000 | 0.31 | 2.247994e+09 | 178.451613 | 1.059810 |
2. Dataset Information (info)
# Get dataset information
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 340 entries, 0 to 339 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Ticker Symbol 340 non-null object 1 Security 340 non-null object 2 GICS Sector 340 non-null object 3 GICS Sub Industry 340 non-null object 4 Current Price 340 non-null float64 5 Price Change 340 non-null float64 6 Volatility 340 non-null float64 7 ROE 340 non-null int64 8 Cash Ratio 340 non-null int64 9 Net Cash Flow 340 non-null int64 10 Net Income 340 non-null int64 11 Earnings Per Share 340 non-null float64 12 Estimated Shares Outstanding 340 non-null float64 13 P/E Ratio 340 non-null float64 14 P/B Ratio 340 non-null float64 dtypes: float64(7), int64(4), object(4) memory usage: 40.0+ KB
3. Summary Statistics (describe)
# Get summary statistics for numerical columns
display(df.describe())
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 340.000000 | 340.000000 | 340.000000 | 340.000000 | 340.000000 | 3.400000e+02 | 3.400000e+02 | 340.000000 | 3.400000e+02 | 340.000000 | 340.000000 |
| mean | 80.862345 | 4.078194 | 1.525976 | 39.597059 | 70.023529 | 5.553762e+07 | 1.494385e+09 | 2.776662 | 5.770283e+08 | 32.612563 | -1.718249 |
| std | 98.055086 | 12.006338 | 0.591798 | 96.547538 | 90.421331 | 1.946365e+09 | 3.940150e+09 | 6.587779 | 8.458496e+08 | 44.348731 | 13.966912 |
| min | 4.500000 | -47.129693 | 0.733163 | 1.000000 | 0.000000 | -1.120800e+10 | -2.352800e+10 | -61.200000 | 2.767216e+07 | 2.935451 | -76.119077 |
| 25% | 38.555000 | -0.939484 | 1.134878 | 9.750000 | 18.000000 | -1.939065e+08 | 3.523012e+08 | 1.557500 | 1.588482e+08 | 15.044653 | -4.352056 |
| 50% | 59.705000 | 4.819505 | 1.385593 | 15.000000 | 47.000000 | 2.098000e+06 | 7.073360e+08 | 2.895000 | 3.096751e+08 | 20.819876 | -1.067170 |
| 75% | 92.880001 | 10.695493 | 1.695549 | 27.000000 | 99.000000 | 1.698108e+08 | 1.899000e+09 | 4.620000 | 5.731175e+08 | 31.764755 | 3.917066 |
| max | 1274.949951 | 55.051683 | 4.580042 | 917.000000 | 958.000000 | 2.076400e+10 | 2.444200e+10 | 50.090000 | 6.159292e+09 | 528.039074 | 129.064585 |
4. Duplicate Check
# Check for duplicate rows
duplicate_check = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_check}")
Number of duplicate rows: 0
5. Missing Value Check
# Check for missing values
missing_values = df.isnull().sum()
# Print missing values count
if missing_values.sum() == 0:
print("No missing values in the dataset.")
else:
print("Missing values in each column:\n", missing_values)
No missing values in the dataset.
6. Data Type Check
# Check data types of each column
data_types = df.dtypes
print("Data types of each column:\n", data_types)
Data types of each column: Ticker Symbol object Security object GICS Sector object GICS Sub Industry object Current Price float64 Price Change float64 Volatility float64 ROE int64 Cash Ratio int64 Net Cash Flow int64 Net Income int64 Earnings Per Share float64 Estimated Shares Outstanding float64 P/E Ratio float64 P/B Ratio float64 dtype: object
7. Unique Value Check (for Categorical Columns)
# Check unique values in categorical columns and return the count of unique values
for col in df.select_dtypes(include='object').columns:
unique_value_count = df[col].nunique()
print(f"Column: {col} | Unique Values Count: {unique_value_count}")
Column: Ticker Symbol | Unique Values Count: 340 Column: Security | Unique Values Count: 340 Column: GICS Sector | Unique Values Count: 11 Column: GICS Sub Industry | Unique Values Count: 104
8. Outlier Check
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
# Visualize outliers using box plots for numerical columns
numeric_columns = ['Current Price', 'Price Change', 'Volatility', 'ROE', 'Cash Ratio',
'Net Cash Flow', 'Net Income', 'Earnings Per Share', 'Estimated Shares Outstanding',
'P/E Ratio', 'P/B Ratio']
plt.figure(figsize=(15,10))
for i, col in enumerate(numeric_columns, 1):
plt.subplot(4, 3, i)
sns.boxplot(x=df[col])
plt.title(f'Box plot of {col}')
plt.tight_layout()
plt.show()
Observation
The box plots show the presence of outliers in several columns, but since outliers can sometimes represent significant data points, we'll proceed cautiously.
EDA Univariate analysis
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
# Load the dataset
df = pd.read_csv('/content/stock_data.csv')
# Set the aesthetic style of the plots
sns.set(style="whitegrid")
# Automatically fetch the numeric columns
numerical_features = df.select_dtypes(include='number').columns
# Create a 5x3 grid for the subplots
fig, axes = plt.subplots(5, 3, figsize=(18, 20))
# Flatten the axes array for easier indexing
axes = axes.flatten()
# Plot distribution for each numerical feature
for i, feature in enumerate(numerical_features):
sns.histplot(df[feature], kde=True, bins=30, color='skyblue', ax=axes[i])
# Plot the mean line
mean_value = df[feature].mean()
axes[i].axvline(mean_value, color='green', linestyle='dashed', linewidth=2)
# Add title and labels
axes[i].set_title(f'Distribution of {feature}', fontsize=14)
axes[i].set_xlabel(feature, fontsize=12)
axes[i].set_ylabel('Frequency', fontsize=12)
# Display the mean value on the plot
axes[i].text(mean_value, axes[i].get_ylim()[1]*0.9, f'Mean: {mean_value:.2f}',
color='green', fontsize=12)
# Remove any empty subplots (if there are fewer than 15 features)
for i in range(len(numerical_features), 15):
fig.delaxes(axes[i])
# Adjust the layout to prevent overlap
plt.tight_layout()
plt.show()
Observation From the above we can make the following obsrvation just to mention a few
2. Box Plots for Numerical Variables
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
# Load the dataset
df = pd.read_csv('/content/stock_data.csv')
# Set the aesthetic style of the plots
sns.set(style="whitegrid")
# Automatically fetch the numeric columns
numerical_features = df.select_dtypes(include='number').columns
# Create a 5x3 grid for the subplots (total of 15 subplots)
fig, axes = plt.subplots(5, 3, figsize=(18, 20))
# Flatten the axes array for easier indexing
axes = axes.flatten()
# Plot box plots for each numerical feature
for i, feature in enumerate(numerical_features):
sns.boxplot(data=df, y=feature, ax=axes[i])
# Calculate the mean of the feature
mean_val = df[feature].mean()
# Add a mean line to the boxplot
axes[i].axhline(mean_val, color='red', linestyle='--', linewidth=2)
# Annotate the mean value
axes[i].annotate(f'Mean: {mean_val:.2f}', xy=(0.5, mean_val), xycoords='data',
xytext=(0, 10), textcoords='offset points', color='red',
ha='center', fontsize=10)
axes[i].set_title(f'Box Plot of {feature}', fontsize=14)
axes[i].set_xlabel('')
axes[i].set_ylabel(feature, fontsize=12)
# Remove any empty subplots (if the number of numerical features is less than 15)
for i in range(len(numerical_features), 15):
fig.delaxes(axes[i])
# Adjust the layout to prevent overlap
plt.tight_layout()
plt.show()
Observation From the box plots of the numerical variables we can observe the presence of outliers.
3. Distribution of Categorical Variables
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
# Load the dataset
df = pd.read_csv('/content/stock_data.csv')
# Set the aesthetic style of the plots
sns.set(style="whitegrid")
# List of categorical features
categorical_features = ['GICS Sector']
# Plot count plots for each categorical feature
for feature in categorical_features:
plt.figure(figsize=(12, 8))
# Create the countplot
ax = sns.countplot(y=df[feature], palette='Set2')
# Add counts on top of the bars
for p in ax.patches:
ax.annotate(f'{p.get_width():.0f}', # Display count (width of bar)
(p.get_width(), p.get_y() + p.get_height() / 2), # Position the text
ha='left', va='center', fontsize=12, color='black', xytext=(5, 0),
textcoords='offset points') # Adjust the text offset slightly
# Set titles and labels
plt.title(f'Distribution of {feature}', fontsize=16)
plt.xlabel('Count', fontsize=12)
plt.ylabel(feature, fontsize=12)
# Ensure a clean layout
plt.tight_layout()
plt.show()
Observation
From the distribution of GICS Sector, I observed that the sector with the majority of sub-sectors is Industrial, which contains 53 sub-sectors. This is followed by the Financial sector, which has 49 sub-sectors, and the Health Care sector, which has 40 sub-sectors. These insights highlight the Industrial sector as having the most diverse range of sub-sectors in the dataset.
4. Pair Plots for Selected Numerical Features
EDA Bivariate analysis
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
# Load the dataset
df = pd.read_csv('/content/stock_data.csv')
# Set the aesthetic style of the plots
sns.set(style="whitegrid")
# List of pairs of numerical features for scatter plots
numerical_pairs = [
('Current Price', 'Earnings Per Share'),
('Price Change', 'Volatility'),
('Net Income', 'Earnings Per Share'),
('P/E Ratio', 'P/B Ratio')
]
# Create a 2x2 subplot grid (2 rows, 2 columns)
fig, axes = plt.subplots(2, 2, figsize=(14, 10)) # Adjust figsize to fit 2x2 grid
fig.tight_layout(pad=5.0) # Adjust spacing between plots
# Flatten the axes array for easier indexing
axes = axes.flatten()
# Plot scatter plots for each pair
for i, (x, y) in enumerate(numerical_pairs):
sns.scatterplot(data=df, x=x, y=y, color='coral', alpha=0.7, ax=axes[i])
axes[i].set_title(f'Scatter Plot of {x} vs. {y}', fontsize=14)
axes[i].set_xlabel(x, fontsize=12)
axes[i].set_ylabel(y, fontsize=12)
# Show the entire figure
plt.show()
Observation
based on the Scatter Plots for Numerical Variables we can see various clustreing some with very few datapoints
2. Correlation Heatmap
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
# Load the dataset
df = pd.read_csv('/content/stock_data.csv')
# Set the aesthetic style of the plots
sns.set(style="whitegrid")
# Compute the correlation matrix
corr_matrix = df[['Current Price', 'Price Change', 'Volatility', 'ROE', 'Cash Ratio',
'Net Cash Flow', 'Net Income', 'Earnings Per Share',
'Estimated Shares Outstanding', 'P/E Ratio', 'P/B Ratio']].corr()
# Plot the heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5, vmin=-1, vmax=1)
plt.title('Correlation Heatmap of Numerical Features', fontsize=16)
plt.tight_layout()
plt.show()
3. Box Plots by Category
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
# Load the dataset
df = pd.read_csv('/content/stock_data.csv')
# Set the aesthetic style of the plots
sns.set(style="whitegrid")
# List of numerical features to compare by GICS Sector
numerical_features = ['Current Price', 'Price Change', 'Volatility', 'Earnings Per Share', 'P/E Ratio']
# Create a 3x2 subplot grid (3 rows, 2 columns)
fig, axes = plt.subplots(3, 2, figsize=(18, 12)) # Adjust the figure size for a 3x2 grid
fig.tight_layout(pad=5.0) # Adjust spacing between plots
# Flatten the axes array for easier indexing
axes = axes.flatten()
# Plot box plots for each numerical feature by GICS Sector
for i, feature in enumerate(numerical_features):
sns.boxplot(data=df, x='GICS Sector', y=feature, palette='Set2', ax=axes[i])
axes[i].set_title(f'Box Plot of {feature} by GICS Sector', fontsize=14)
axes[i].set_xlabel('GICS Sector', fontsize=12)
axes[i].set_ylabel(feature, fontsize=12)
axes[i].tick_params(axis='x', rotation=45) # Rotate x-axis labels for better readability
# If there are fewer plots than subplots, turn off the remaining axes
if len(numerical_features) < len(axes):
for j in range(len(numerical_features), len(axes)):
fig.delaxes(axes[j])
# Show the entire figure
plt.show()
4. Pair Plots for Selected Numerical Features
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
# Load the dataset
df = pd.read_csv('/content/stock_data.csv')
# Set the aesthetic style of the plots
sns.set(style="whitegrid")
# List of selected numerical features for pair plot
subset_numerical_features = ['Current Price', 'Price Change', 'Volatility',
'Earnings Per Share', 'P/E Ratio']
# Plot pair plots for selected numerical features
plt.figure(figsize=(12, 10))
sns.pairplot(df[subset_numerical_features], diag_kind='kde', plot_kws={'alpha':0.5})
plt.suptitle('Pair Plot of Selected Numerical Features', fontsize=16, y=1.02)
plt.tight_layout()
plt.show()
<Figure size 1200x1000 with 0 Axes>
Observation
From the Pair Plots for selected numerical features, I observed that the data points for Current Price, Price Change, Volatility, Earnings Per Share, and P/E Ratio suggest the presence of three to four clusters. This indicates potential groupings within the data, where similar patterns or relationships emerge across these financial metrics, suggesting distinct market behaviors or stock characteristics across the clusters.
Questions:
Question 1: What does the distribution of stock prices look like?
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
# Set the aesthetic style of the plots
sns.set(style="whitegrid")
# Create a figure and axis
plt.figure(figsize=(10,6))
# Plot the distribution of 'Current Price' using seaborn's histplot
ax = sns.histplot(df['Current Price'], kde=True, bins=30, color='blue')
# Add counts on top of the bars
for p in ax.patches:
height = p.get_height()
if height > 0:
ax.text(p.get_x() + p.get_width() / 2, height + 1, int(height),
ha="center", fontsize=10)
# Calculate and add mean line
mean_price = df['Current Price'].mean()
plt.axvline(mean_price, color='red', linestyle='--', label=f'Mean: {mean_price:.2f}')
# Add title and labels
plt.title('Distribution of Stock Prices', fontsize=16)
plt.xlabel('Stock Price (USD)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
# Format the plot for better readability
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
# Add legend for the mean line
plt.legend()
# Display the plot
plt.tight_layout()
plt.show()
Observation on Distribution of Stock Prices
Distribution Characteristics:
The distribution of stock prices exhibits the following characteristics:
Left Skewness: The distribution shows a left skew, indicating that the majority of stock prices are concentrated towards the lower end of the range. This suggests that while many stocks have relatively lower prices, there are fewer stocks with significantly higher prices.
Frequency Peaks:
Overall Distribution: The histogram, along with the kernel density estimate (KDE), shows that most stock prices are below USD 200, with decreasing frequency as the price increases. The KDE line smooths out the distribution and confirms the left skewness, providing a clearer view of the density of stock prices across the range.
Conclusion: The distribution analysis reveals that most stocks are priced lower, with a significant concentration around USD 140. Higher stock prices are less common but still present, particularly at around USD 580 and USD 1250. This distribution insight helps in understanding the general pricing landscape of the stocks within this dataset, suggesting a market where lower-priced stocks are more prevalent.
Questions 2: The stocks of which economic sector have seen the maximum price increase on average?
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
# Load the dataset
df = pd.read_csv('/content/stock_data.csv')
# Calculate the average price change by sector
sector_price_change = df.groupby('GICS Sector')['Price Change'].mean()
# Find the maximum price increase value
max_increase_value = sector_price_change.max()
# Set the aesthetic style of the plots
sns.set(style="whitegrid")
# Create a figure and axis
plt.figure(figsize=(12, 8))
# Plot a bar chart showing the average price change by sector
ax = sector_price_change.sort_values().plot(kind='barh', color='skyblue', edgecolor='black')
# Add titles and labels
plt.title('Average Price Change by Economic Sector', fontsize=16)
plt.xlabel('Average Price Change (%)', fontsize=12)
plt.ylabel('GICS Sector', fontsize=12)
# Add percentage labels at the top of each bar
for p in ax.patches:
width = p.get_width()
ax.annotate(f'{width:.2f}%', (width, p.get_y() + p.get_height() / 2),
ha='left', va='center', xytext=(3, 0), textcoords='offset points')
# Highlight the sector with the maximum increase
plt.axvline(x=max_increase_value, color='red', linestyle='--', label=f'Max Increase: {max_increase_value:.2f}%')
# Add a legend
plt.legend()
# Display the plot
plt.tight_layout()
plt.show()
# Create a table with the percentages
print("\nAverage Price Change by Economic Sector:")
print(sector_price_change.reset_index().rename(columns={'Price Change': 'Average Price Change (%)'}))
# Group the dataset by 'GICS Sector' and calculate the average price change for each sector
sector_price_change = df.groupby('GICS Sector')['Price Change'].mean()
# Find the sector with the maximum average price increase
max_increase_sector = sector_price_change.idxmax()
max_increase_value = sector_price_change.max()
# Display the results
print(f"The sector with the maximum average price increase is: {max_increase_sector}")
print(f"The average price increase in this sector is: {max_increase_value:.2f}%")
Average Price Change by Economic Sector:
GICS Sector Average Price Change (%)
0 Consumer Discretionary 5.846093
1 Consumer Staples 8.684750
2 Energy -10.228289
3 Financials 3.865406
4 Health Care 9.585652
5 Industrials 2.833127
6 Information Technology 7.217476
7 Materials 5.589738
8 Real Estate 6.205548
9 Telecommunications Services 6.956980
10 Utilities 0.803657
The sector with the maximum average price increase is: Health Care
The average price increase in this sector is: 9.59%
Observations:
Health Care Sector:
Comparison with Other Sectors:
Conclusion: The Health Care sector's leading position in average price increase reflects its strong market performance relative to other sectors. This sector's ability to achieve the highest average price increase highlights its resilience and potential for growth, making it a notable area of interest for investors seeking high returns.
Questions 3: How are the different variables correlated with each other?
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Load the dataset
df = pd.read_csv('/content/stock_data.csv')
# Select only numeric columns
numeric_df = df.select_dtypes(include=['float64', 'int64'])
# Calculate the correlation matrix
correlation_matrix = numeric_df.corr()
# Set up the matplotlib figure
plt.figure(figsize=(12, 10))
# Draw the heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
# Add title
plt.title('Correlation Matrix of Stock Data')
# Show the plot
plt.show()
# Interpret the correlation matrix
def interpret_correlation(correlation_matrix):
# Define thresholds for strong and weak correlations
strong_threshold = 0.7
weak_threshold = 0.3
# Get pairs of variables with strong correlations
strong_corr = []
weak_corr = []
num_vars = correlation_matrix.shape[0]
for i in range(num_vars):
for j in range(i + 1, num_vars):
var1 = correlation_matrix.columns[i]
var2 = correlation_matrix.columns[j]
corr_value = correlation_matrix.iloc[i, j]
if abs(corr_value) >= strong_threshold:
strong_corr.append((var1, var2, corr_value))
elif abs(corr_value) >= weak_threshold:
weak_corr.append((var1, var2, corr_value))
print("Strong Correlations (|r| >= 0.7):")
for var1, var2, value in strong_corr:
print(f" {var1} and {var2}: {value:.2f}")
print("\nWeak Correlations (0.3 <= |r| < 0.7):")
for var1, var2, value in weak_corr:
print(f" {var1} and {var2}: {value:.2f}")
# Call the function to interpret the correlation matrix
interpret_correlation(correlation_matrix)
Strong Correlations (|r| >= 0.7): Weak Correlations (0.3 <= |r| < 0.7): Current Price and Earnings Per Share: 0.48 Price Change and Volatility: -0.41 Volatility and Net Income: -0.38 Volatility and Earnings Per Share: -0.38 ROE and Earnings Per Share: -0.41 Net Income and Earnings Per Share: 0.56 Net Income and Estimated Shares Outstanding: 0.59
Observations on Correlation Analysis
In our dataset, no pairs of variables exhibit a strong linear relationship with a correlation coefficient of 0.7 or higher. This lack of strong correlations suggests that the variables do not exhibit extremely pronounced linear relationships with one another.
Current Price and Earnings Per Share (EPS): 0.48
Price Change and Volatility: -0.41
Volatility and Net Income: -0.38
Volatility and Earnings Per Share (EPS): -0.38
Return on Equity (ROE) and EPS: -0.41
Net Income and EPS: 0.56
Net Income and Estimated Shares Outstanding: 0.59
The correlation analysis reveals several moderate relationships between financial metrics. Positive correlations are seen between current price and EPS, net income and EPS, and net income and estimated shares outstanding. In contrast, negative correlations are observed between price change and volatility, volatility and net income, volatility and EPS, and ROE and EPS. These insights provide a nuanced understanding of how different financial indicators relate to each other and can inform decisions related to stock selection and portfolio management.
Questions 4: Cash ratio provides a measure of a company's ability to cover its short-term obligations using only cash and cash equivalents. How does the average cash ratio vary across economic sectors?
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Load the dataset
df = pd.read_csv('/content/stock_data.csv')
# Group by GICS Sector and calculate the average cash ratio
sector_cash_ratio = df.groupby('GICS Sector')['Cash Ratio'].mean().reset_index()
# Set up the matplotlib figure
plt.figure(figsize=(12, 8))
# Create a bar plot of average cash ratio across economic sectors
ax = sns.barplot(x='Cash Ratio', y='GICS Sector', data=sector_cash_ratio, palette='viridis')
# Add title and labels
plt.title('Average Cash Ratio Across Economic Sectors', fontsize=16)
plt.xlabel('Average Cash Ratio', fontsize=12)
plt.ylabel('GICS Sector', fontsize=12)
# Add counts (cash ratio values) on top of the bars
for index, value in enumerate(sector_cash_ratio['Cash Ratio']):
ax.text(value + 0.01, index, f'{value:.2f}', va='center', fontsize=10, color='black')
# Display the plot
plt.tight_layout()
plt.show()
# Print the average cash ratio for each sector
print("Average Cash Ratio by Economic Sector:")
print(sector_cash_ratio)
Average Cash Ratio by Economic Sector:
GICS Sector Cash Ratio
0 Consumer Discretionary 49.575000
1 Consumer Staples 70.947368
2 Energy 51.133333
3 Financials 98.591837
4 Health Care 103.775000
5 Industrials 36.188679
6 Information Technology 149.818182
7 Materials 41.700000
8 Real Estate 50.111111
9 Telecommunications Services 117.000000
10 Utilities 13.625000
Observations on Average Cash Ratio by Economic Sector
1. Consumer Discretionary: 49.58
2. Consumer Staples: 70.95
3. Energy: 51.13
4. Financials: 98.59
5. Health Care: 103.78
6. Industrials: 36.19
7. Information Technology: 149.82
8. Materials: 41.70
9. Real Estate: 50.11
10. Telecommunications Services: 117.00
11. Utilities: 13.63
Summary
The analysis of the average cash ratio across different economic sectors reveals a diverse range of liquidity positions. The Information Technology and Health Care sectors stand out with the highest cash ratios, reflecting strong liquidity. In contrast, the Utilities and Industrials sectors show the lowest cash ratios, indicating relatively weaker liquidity. These variations highlight the different financial strategies and operational models across sectors, which can impact their ability to manage short-term financial obligations.
Questions 5: P/E ratios can help determine the relative value of a company's shares as they signify the amount of money an investor is willing to invest in a single share of a company per dollar of its earnings. How does the P/E ratio vary, on average, across economic sectors?
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Load the dataset
df = pd.read_csv('/content/stock_data.csv')
# Group by GICS Sector and calculate the average P/E ratio
sector_pe_ratio = df.groupby('GICS Sector')['P/E Ratio'].mean().reset_index()
# Set up the matplotlib figure
plt.figure(figsize=(12, 8))
# Create a bar plot of average P/E ratio across economic sectors
ax = sns.barplot(x='P/E Ratio', y='GICS Sector', data=sector_pe_ratio, palette='viridis')
# Add title and labels
plt.title('Average P/E Ratio Across Economic Sectors', fontsize=16)
plt.xlabel('Average P/E Ratio', fontsize=12)
plt.ylabel('GICS Sector', fontsize=12)
# Add counts (P/E ratio values) on top of the bars
for index, value in enumerate(sector_pe_ratio['P/E Ratio']):
ax.text(value + 0.1, index, f'{value:.2f}', va='center', fontsize=10, color='black')
# Display the plot
plt.tight_layout()
plt.show()
# Print the average P/E ratio for each sector
print("Average P/E Ratio by Economic Sector:")
print(sector_pe_ratio)
Average P/E Ratio by Economic Sector:
GICS Sector P/E Ratio
0 Consumer Discretionary 35.211613
1 Consumer Staples 25.521195
2 Energy 72.897709
3 Financials 16.023151
4 Health Care 41.135272
5 Industrials 18.259380
6 Information Technology 43.782546
7 Materials 24.585352
8 Real Estate 43.065585
9 Telecommunications Services 12.222578
10 Utilities 18.719412
Observations on Average P/E Ratio by Economic Sector
1. Consumer Discretionary: 35.21
2. Consumer Staples: 25.52
3. Energy: 72.90
4. Financials: 16.02
5. Health Care: 41.14
6. Industrials: 18.26
7. Information Technology: 43.78
8. Materials: 24.59
9. Real Estate: 43.07
10. Telecommunications Services: 12.22
11. Utilities: 18.72
The analysis reveals significant variation in average P/E ratios across sectors. The Energy, Information Technology, and Real Estate sectors show higher average P/E ratios, reflecting strong investor expectations and potential for growth. In contrast, sectors such as Financials, Telecommunications Services, and Consumer Staples have lower average P/E ratios, indicating more conservative valuations. These differences highlight the diverse investment appetites and growth prospects associated with each sector.
Duplicate value check
# Check for duplicate rows
duplicate_check = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_check}")
Number of duplicate rows: 0
Missing value treatment
# Check for missing values
missing_values = df.isnull().sum()
# Print missing values count
if missing_values.sum() == 0:
print("No missing values in the dataset.")
else:
print("Missing values in each column:\n", missing_values)
No missing values in the dataset.
Outlier check
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
# Load the dataset
df = pd.read_csv('/content/stock_data.csv')
# Outlier detection using box plots
numeric_columns = ['Current Price', 'Price Change', 'Volatility', 'ROE', 'Cash Ratio',
'Net Cash Flow', 'Net Income', 'Earnings Per Share', 'Estimated Shares Outstanding',
'P/E Ratio', 'P/B Ratio']
plt.figure(figsize=(15,10))
for i, col in enumerate(numeric_columns, 1):
plt.subplot(4, 3, i)
sns.boxplot(x=df[col])
plt.title(f'Box plot of {col}')
plt.tight_layout()
plt.show()
Observation There are Outliers as observed in the box plots
Handling Outliers:
Outliers can heavily affect clustering algorithms, especially K-Means, which uses Euclidean distance, making it sensitive to extreme values.
Key Considerations:
Options for Handling Outliers:
A. Removing Outliers:
B. Transforming Outliers:
C. Keep Outliers:
You may decide to keep outliers if they contain useful information. For example, extreme stock prices or returns might represent critical financial trends.
Our Approach: we choose Winsorizing method because it is generally suitable for financial data, as it helps reduce the influence of extreme values without discarding potentially useful information.
Winsorizing Of Outliers
import pandas as pd
import numpy as np
from scipy.stats import mstats
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
# Load the dataset
df = pd.read_csv('/content/stock_data.csv')
# List of numeric columns
numeric_columns = ['Current Price', 'Price Change', 'Volatility', 'ROE', 'Cash Ratio',
'Net Cash Flow', 'Net Income', 'Earnings Per Share', 'Estimated Shares Outstanding',
'P/E Ratio', 'P/B Ratio']
# Define the percentile range for Winsorization
lower_percentile = 5
upper_percentile = 95
# Apply Winsorization to each numeric column
for col in numeric_columns:
# Calculate lower and upper percentiles
lower_limit = np.percentile(df[col], lower_percentile)
upper_limit = np.percentile(df[col], upper_percentile)
# Apply Winsorization
df[col] = np.clip(df[col], lower_limit, upper_limit)
# Display the first few rows of the Winsorized data
print(df.head())
# Plot boxplots to visualize the effect of Winsorization
plt.figure(figsize=(15, 10))
for i, col in enumerate(numeric_columns, 1):
plt.subplot(3, 4, i)
sns.boxplot(x=df[col])
plt.title(col)
plt.tight_layout()
plt.show()
Ticker Symbol Security GICS Sector \
0 AAL American Airlines Group Industrials
1 ABBV AbbVie Health Care
2 ABT Abbott Laboratories Health Care
3 ADBE Adobe Systems Inc Information Technology
4 ADI Analog Devices, Inc. Information Technology
GICS Sub Industry Current Price Price Change Volatility ROE \
0 Airlines 42.349998 9.999995 1.687151 135.0
1 Pharmaceuticals 59.240002 8.339433 2.197887 130.0
2 Health Care Equipment 44.910000 11.301121 1.273646 21.0
3 Application Software 93.940002 13.977195 1.357679 9.0
4 Semiconductors 55.320000 -1.827858 1.701169 14.0
Cash Ratio Net Cash Flow Net Income Earnings Per Share \
0 51.00 -604000000.0 7.270650e+09 9.711
1 77.00 51000000.0 5.144000e+09 3.150
2 67.00 938000000.0 4.423000e+09 2.940
3 180.00 -240840000.0 6.295510e+08 1.260
4 201.55 315120000.0 6.968780e+08 0.310
Estimated Shares Outstanding P/E Ratio P/B Ratio
0 6.681299e+08 9.879286 -8.784219
1 1.633016e+09 18.806350 -8.750068
2 1.504422e+09 15.275510 -0.394171
3 4.996437e+08 74.555557 4.199651
4 1.701898e+09 93.089287 1.059810
Observation
After Winsorizing there are fewer observed outliers
Feature engineering (if needed)
# Scaling the features
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df[numeric_columns]), columns=numeric_columns)
Distribution after Scaling the features
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
# Load the dataset
df = pd.read_csv('/content/stock_data.csv')
# List of numeric columns
numeric_columns = ['Current Price', 'Price Change', 'Volatility', 'ROE', 'Cash Ratio',
'Net Cash Flow', 'Net Income', 'Earnings Per Share', 'Estimated Shares Outstanding',
'P/E Ratio', 'P/B Ratio']
# Define the percentile range for Winsorization
lower_percentile = 5
upper_percentile = 95
# Apply Winsorization to each numeric column
for col in numeric_columns:
# Calculate lower and upper percentiles
lower_limit = np.percentile(df[col], lower_percentile)
upper_limit = np.percentile(df[col], upper_percentile)
# Apply Winsorization
df[col] = np.clip(df[col], lower_limit, upper_limit)
# Scaling the features
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df[numeric_columns]), columns=numeric_columns)
# Plot distributions before and after scaling
num_features = len(numeric_columns)
fig, axes = plt.subplots(nrows=num_features, ncols=2, figsize=(15, 2 * num_features))
for i, col in enumerate(numeric_columns):
# Plot before scaling
sns.histplot(df[col], kde=True, color='blue', bins=30, ax=axes[i, 0])
axes[i, 0].set_title(f'Before Scaling\n{col}')
# Plot after scaling
sns.histplot(df_scaled[col], kde=True, color='green', bins=30, ax=axes[i, 1])
axes[i, 1].set_title(f'After Scaling\n{col}')
plt.tight_layout()
plt.suptitle('Distribution Before and After Scaling', y=1.02)
plt.show()
Observations:
After applying StandardScaler to the dataset, I observed that the distributions of each feature look almost identical to the original ones, but with the data now centered around zero with a standard deviation of one. This scaling ensures that no single feature dominates the clustering process due to differences in scale, making the dataset more balanced.
numerical columns Box plots after Winsorizing Outliers and scaling
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
# Load the dataset
df = pd.read_csv('/content/stock_data.csv')
# List of numeric columns
numeric_columns = ['Current Price', 'Price Change', 'Volatility', 'ROE', 'Cash Ratio',
'Net Cash Flow', 'Net Income', 'Earnings Per Share', 'Estimated Shares Outstanding',
'P/E Ratio', 'P/B Ratio']
# Define the percentile range for Winsorization
lower_percentile = 5
upper_percentile = 95
# Save the original data for comparison
df_original = df.copy()
# Apply Winsorization to each numeric column
for col in numeric_columns:
# Calculate lower and upper percentiles
lower_limit = np.percentile(df[col], lower_percentile)
upper_limit = np.percentile(df[col], upper_percentile)
# Apply Winsorization
df[col] = np.clip(df[col], lower_limit, upper_limit)
# Scaling the features
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df[numeric_columns]), columns=numeric_columns)
# Plot boxplots before and after Winsorization and Scaling
fig, axes = plt.subplots(nrows=len(numeric_columns), ncols=2, figsize=(15, 2 * len(numeric_columns)))
for i, col in enumerate(numeric_columns):
# Boxplot before Winsorization and Scaling
sns.boxplot(x=df_original[col], color='blue', ax=axes[i, 0])
axes[i, 0].set_title(f'Before Winsorization\n{col}')
# Boxplot after Winsorization and Scaling
sns.boxplot(x=df_scaled[col], color='green', ax=axes[i, 1])
axes[i, 1].set_title(f'After Winsorization and Scaling\n{col}')
plt.tight_layout()
plt.suptitle('Box Plots Before and After Winsorization and Standardization', y=1.02)
plt.show()
Obsrvation
The Winsorization process applied to the stock data capped only a few outliers. Given the lower and upper limits set at the 5th and 95th percentiles, respectively, this adjustment was appropriate for stock prices in dollars. The chosen percentile range ensures that extreme values are controlled while retaining the overall data distribution and variability.
K-means Clustering: Elbow Curve
KMedoids algorithm from the scikit-learn-extra library, which supports Manhattan distance is used below
# Install scikit-learn-extra for KMedoids
!pip install scikit-learn-extra -q
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn_extra.cluster import KMedoids
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
# Load the dataset
df = pd.read_csv('/content/stock_data.csv')
# List of numeric columns
numeric_columns = ['Current Price', 'Price Change', 'Volatility', 'ROE', 'Cash Ratio',
'Net Cash Flow', 'Net Income', 'Earnings Per Share', 'Estimated Shares Outstanding',
'P/E Ratio', 'P/B Ratio']
# Define the percentile range for Winsorization
lower_percentile = 5
upper_percentile = 95
# Apply Winsorization to each numeric column
for col in numeric_columns:
# Calculate lower and upper percentiles
lower_limit = np.percentile(df[col], lower_percentile)
upper_limit = np.percentile(df[col], upper_percentile)
# Apply Winsorization
df[col] = np.clip(df[col], lower_limit, upper_limit)
# Scaling the features
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df[numeric_columns]), columns=numeric_columns)
# Function to plot Elbow Curve for KMedoids using Manhattan distance
def plot_elbow_curve(data, title, subplot):
inertia = []
k_range = range(2, 11)
for k in k_range:
kmedoids = KMedoids(n_clusters=k, metric='manhattan', random_state=42)
kmedoids.fit(data)
inertia.append(kmedoids.inertia_)
subplot.plot(k_range, inertia, marker='o')
subplot.set_title(title)
subplot.set_xlabel('Number of Clusters')
subplot.set_ylabel('Inertia')
subplot.grid(True)
# Create subplots
fig, axs = plt.subplots(nrows=1, ncols=1, figsize=(8, 6))
# Plot Elbow Curve for Winsorized and scaled data using Manhattan distance
plot_elbow_curve(df_scaled, 'Elbow Curve (Manhattan Distance)', axs)
plt.tight_layout()
plt.show()
Explanation
The code now uses the KMedoids algorithm from the scikit-learn-extra library, which supports Manhattan distance. The metric='manhattan' argument specifies the use of Manhattan distance.
Observation We observe that the first point of elbow is 3 the 4 hence good choises for our k
Euclidean distance KMeans
KMeans implementation from scikit-learn, as it defaults to using Euclidean distance.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
# Load the dataset
df = pd.read_csv('/content/stock_data.csv')
# List of numeric columns
numeric_columns = ['Current Price', 'Price Change', 'Volatility', 'ROE', 'Cash Ratio',
'Net Cash Flow', 'Net Income', 'Earnings Per Share', 'Estimated Shares Outstanding',
'P/E Ratio', 'P/B Ratio']
# Define the percentile range for Winsorization
lower_percentile = 5
upper_percentile = 95
# Save the original data for comparison
df_original = df.copy()
# Apply Winsorization to each numeric column
for col in numeric_columns:
# Calculate lower and upper percentiles
lower_limit = np.percentile(df[col], lower_percentile)
upper_limit = np.percentile(df[col], upper_percentile)
# Apply Winsorization
df[col] = np.clip(df[col], lower_limit, upper_limit)
# Scaling the features
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df[numeric_columns]), columns=numeric_columns)
# Function to plot Elbow Curve
def plot_elbow_curve(data, title, subplot):
inertia = []
k_range = range(2, 11)
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(data)
inertia.append(kmeans.inertia_)
subplot.plot(k_range, inertia, marker='o')
subplot.set_title(title)
subplot.set_xlabel('Number of Clusters')
subplot.set_ylabel('Inertia')
subplot.grid(True)
# Create subplots
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(15, 6))
# Plot Elbow Curve for original data
plot_elbow_curve(df[numeric_columns], 'Elbow Curve (Original Data)', axs[0])
# Plot Elbow Curve for Winsorized and scaled data
plot_elbow_curve(df_scaled, 'Elbow Curve (Winsorized & Scaled Data)', axs[1])
plt.tight_layout()
plt.show()
Observation
Elbow Curve Analysis:
In the Elbow Curve for the original data, I observed distinct elbows at three points: 3, 4, and 7 clusters. These noticeable bends indicate significant reductions in within-cluster variance at these values of ( k ). The presence of multiple elbow points suggests that the data may contain several natural groupings or structures. The variability and potential noise in the original data could be contributing to the appearance of multiple plausible cluster numbers.
For the Winsorized and scaled data, the Elbow Curve appears smooth without any clear, distinct elbows. This smoothness indicates that the preprocessing has reduced the impact of outliers and standardized feature scales, leading to a more stable dataset. As a result, the Elbow Curve is less pronounced, making it difficult to identify an optimal number of clusters visually.
Given the observations:
K-means Clustering: Silhouette Scores
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
# Load the dataset
df = pd.read_csv('/content/stock_data.csv')
# List of numeric columns
numeric_columns = ['Current Price', 'Price Change', 'Volatility', 'ROE', 'Cash Ratio',
'Net Cash Flow', 'Net Income', 'Earnings Per Share', 'Estimated Shares Outstanding',
'P/E Ratio', 'P/B Ratio']
# Define the percentile range for Winsorization
lower_percentile = 5
upper_percentile = 95
# Save the original data for comparison
df_original = df.copy()
# Apply Winsorization to each numeric column
for col in numeric_columns:
# Calculate lower and upper percentiles
lower_limit = np.percentile(df[col], lower_percentile)
upper_limit = np.percentile(df[col], upper_percentile)
# Apply Winsorization
df[col] = np.clip(df[col], lower_limit, upper_limit)
# Scaling the features
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df[numeric_columns]), columns=numeric_columns)
# Function to compute silhouette scores
def compute_silhouette_scores(data):
silhouette_scores = []
k_range = range(2, 11)
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42)
labels = kmeans.fit_predict(data)
score = silhouette_score(data, labels)
silhouette_scores.append(score)
return silhouette_scores
# Create subplots
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(15, 10))
# Plot histograms of the original data
for col in numeric_columns:
axs[0, 0].hist(df_original[col], bins=30, alpha=0.7, label=col)
axs[0, 0].set_title('Original Data Distribution')
axs[0, 0].set_xlabel('Value')
axs[0, 0].set_ylabel('Frequency')
axs[0, 0].legend(loc='upper right')
# Plot silhouette scores for the original data
axs[0, 1].plot(range(2, 11), compute_silhouette_scores(df_original[numeric_columns]), marker='o')
axs[0, 1].set_title('Silhouette Scores (Original Data)')
axs[0, 1].set_xlabel('Number of Clusters')
axs[0, 1].set_ylabel('Silhouette Score')
axs[0, 1].grid(True)
# Plot histograms of the Winsorized and scaled data
for col in numeric_columns:
axs[1, 0].hist(df_scaled[col], bins=30, alpha=0.7, label=col)
axs[1, 0].set_title('Winsorized and Scaled Data Distribution')
axs[1, 0].set_xlabel('Value')
axs[1, 0].set_ylabel('Frequency')
axs[1, 0].legend(loc='upper right')
# Plot silhouette scores for the Winsorized and scaled data
axs[1, 1].plot(range(2, 11), compute_silhouette_scores(df_scaled), marker='o')
axs[1, 1].set_title('Silhouette Scores (Winsorized & Scaled Data)')
axs[1, 1].set_xlabel('Number of Clusters')
axs[1, 1].set_ylabel('Silhouette Score')
axs[1, 1].grid(True)
plt.tight_layout()
plt.show()
Obsrvation on Silhouette Scores and Data Distribution
Silhouette Scores Analysis
The analysis of the silhouette scores reveals distinct trends in clustering performance for both the original and Winsorized & scaled data:
Original Data: The silhouette scores indicate a notable bend in the graph at 3 clusters, with additional bends observed at 4, 5, 6, and 7 clusters. This suggests that the optimal number of clusters might be around 3 to 7, where the silhouette scores are relatively higher.
Winsorized and Scaled Data: For the Winsorized and scaled data, the first significant bend in the silhouette scores is observed at 4 clusters, followed by bends at 5, 6, 7, and 8 clusters. This indicates a clear improvement in clustering quality as the number of clusters increases.
Preferred Number of Clusters
Based on the silhouette scores, we recommend using 4 clusters for the following reasons:
Elbow Method Concurrence: The chosen number of clusters (k=4) aligns well with the elbow method, which suggests that adding more clusters beyond 4 yields diminishing returns in clustering quality. The elbow method is a widely accepted approach to determining the optimal number of clusters by identifying a point where adding additional clusters does not significantly improve the model.
Clustering Quality: The silhouette score for 4 clusters shows a strong and consistent performance compared to other values. This score indicates that data points within each cluster are more similar to each other than to points in other clusters, making 4 clusters a well-balanced choice between complexity and clustering quality.
In summary, selecting 4 clusters is supported by both the silhouette score analysis and the elbow method, providing a robust solution for effective clustering in this dataset.
This report summarizes the findings clearly and justifies the preference for 4 clusters based on both quantitative metrics and the elbow method.
Cluster Profiling before applying Winsorization
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
# Apply K-means with the optimal number of clusters
optimal_k = 4 # Based on elbow and silhouette results
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
df_scaled['Cluster'] = kmeans.fit_predict(df_scaled)
# Cluster profiling by averaging feature values in each cluster
cluster_profile = df_scaled.groupby('Cluster').mean()
display(cluster_profile)
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Cluster | |||||||||||
| 0 | -0.436756 | -1.622242 | 2.342968 | 0.699929 | -0.221363 | -0.110608 | -1.367911 | -1.862300 | -0.114408 | 1.309894 | 0.219435 |
| 1 | 1.658290 | 0.794847 | 0.367191 | -0.136718 | 2.381924 | 0.359666 | -0.131724 | 0.508698 | -0.011635 | 1.014595 | 1.201739 |
| 2 | -0.088480 | 0.083426 | -0.232767 | -0.052685 | -0.186714 | 0.002654 | 0.010077 | 0.128666 | -0.162244 | -0.198898 | -0.117019 |
| 3 | -0.309926 | 0.139248 | -0.669418 | -0.088233 | 0.065186 | -0.580298 | 3.390319 | 0.209466 | 4.406556 | -0.402159 | -0.203198 |
After performing K-Means clustering on the dataset with four clusters, the following observations can be made based on the average values of key financial indicators for each cluster:
This cluster consists of stocks with higher prices, stable growth, moderate profitability, and good liquidity. These may be attractive for investors looking for relatively stable and financially sound companies.
This cluster includes companies with moderate prices, good profitability, positive cash flows, and a high share count. However, the negative P/E ratio may indicate valuation issues or negative earnings, warranting a closer look before investing.
Stocks in this cluster tend to have lower prices and weaker financial health, with negative profitability indicators like ROE and net income. These may be riskier investments, possibly representing companies facing financial difficulties.
This cluster contains stocks that have experienced significant price drops and high volatility. Despite high ROE, these companies are facing financial challenges with negative earnings and cash flows. The high P/E ratio indicates possible overvaluation, which might make these stocks speculative investments.
scatter plots for the selected features from the cluster profiling data:
before applying Winsorization
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Example Data based on the cluster profile
data = {
'Cluster': [0, 1, 2, 3],
'Current Price': [-0.436756, 1.658290, -0.088480, -0.309926],
'Price Change': [-1.622242, 0.794847, 0.083426, 0.139248],
'Volatility': [2.342968, 0.367191, -0.232767, -0.669418],
'ROE': [0.699929, -0.136718, -0.052685, -0.088233],
'Cash Ratio': [-0.221363, 2.381924, -0.186714, 0.065186],
'Net Cash Flow': [-0.110608, 0.359666, 0.002654, -0.580298],
'Net Income': [-1.367911, -0.131724, 0.010077, 3.390319],
'Earnings Per Share': [-1.862300, 0.508698, 0.128666, 0.209466],
'Estimated Shares Outstanding': [-0.114408, -0.011635, -0.162244, 4.406556],
'P/E Ratio': [1.309894, 1.014595, -0.198898, -0.402159],
'P/B Ratio': [0.219435, 1.201739, -0.117019, -0.203198]
}
# Creating a DataFrame
df_cluster_profile = pd.DataFrame(data)
# Plotting scatter plots of selected features
plt.figure(figsize=(14, 10))
# Scatter plot for 'Current Price' vs 'Price Change'
plt.subplot(2, 2, 1)
sns.scatterplot(x='Current Price', y='Price Change', hue='Cluster', data=df_cluster_profile, palette='Set1')
plt.title('Current Price vs Price Change')
# Scatter plot for 'ROE' vs 'Net Income'
plt.subplot(2, 2, 2)
sns.scatterplot(x='ROE', y='Net Income', hue='Cluster', data=df_cluster_profile, palette='Set2')
plt.title('ROE vs Net Income')
# Scatter plot for 'Volatility' vs 'Earnings Per Share'
plt.subplot(2, 2, 3)
sns.scatterplot(x='Volatility', y='Earnings Per Share', hue='Cluster', data=df_cluster_profile, palette='Set3')
plt.title('Volatility vs Earnings Per Share')
# Scatter plot for 'P/E Ratio' vs 'P/B Ratio'
plt.subplot(2, 2, 4)
sns.scatterplot(x='P/E Ratio', y='P/B Ratio', hue='Cluster', data=df_cluster_profile, palette='Set1')
plt.title('P/E Ratio vs P/B Ratio')
plt.tight_layout()
plt.show()
Observation
These insights can be used to guide investors in portfolio diversification, identifying which clusters to invest in based on risk tolerance and return expectations.
Analysis of the GICS Sector distribution across clusters before applying Winsorization
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
# Load the dataset
df = pd.read_csv('/content/stock_data.csv')
# Define the numeric columns (you can adjust this based on your dataset)
numeric_columns = ['Current Price', 'Price Change', 'Volatility', 'ROE',
'Cash Ratio', 'Net Cash Flow', 'Net Income', 'Earnings Per Share',
'Estimated Shares Outstanding', 'P/E Ratio', 'P/B Ratio']
# Scaling the features
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df[numeric_columns]), columns=numeric_columns)
# Add the 'GICS Sector' column to df_scaled for sector analysis
df_scaled['GICS Sector'] = df['GICS Sector']
# Apply K-means with the optimal number of clusters (e.g., 4)
from sklearn.cluster import KMeans
optimal_k = 4 # Based on previous analysis
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
df_scaled['Cluster'] = kmeans.fit_predict(df_scaled[numeric_columns])
# Cluster profiling by averaging only the numeric feature values in each cluster
numeric_cluster_profile = df_scaled.groupby('Cluster')[numeric_columns].mean()
print("Cluster Profiling (Numeric Features):\n", numeric_cluster_profile)
# Cross-tabulation of GICS Sector and Cluster
sector_distribution = pd.crosstab(df_scaled['GICS Sector'], df_scaled['Cluster'])
# Plotting the sector distribution in each cluster
sector_distribution.plot(kind='bar', stacked=True, figsize=(12, 6))
plt.title('Distribution of GICS Sector Across Clusters')
plt.ylabel('Number of Companies')
plt.xlabel('GICS Sector')
plt.xticks(rotation=45)
plt.show()
Cluster Profiling (Numeric Features):
Current Price Price Change Volatility ROE Cash Ratio \
Cluster
0 -0.436756 -1.622242 2.342968 0.699929 -0.221363
1 1.658290 0.794847 0.367191 -0.136718 2.381924
2 -0.088480 0.083426 -0.232767 -0.052685 -0.186714
3 -0.309926 0.139248 -0.669418 -0.088233 0.065186
Net Cash Flow Net Income Earnings Per Share \
Cluster
0 -0.110608 -1.367911 -1.862300
1 0.359666 -0.131724 0.508698
2 0.002654 0.010077 0.128666
3 -0.580298 3.390319 0.209466
Estimated Shares Outstanding P/E Ratio P/B Ratio
Cluster
0 -0.114408 1.309894 0.219435
1 -0.011635 1.014595 1.201739
2 -0.162244 -0.198898 -0.117019
3 4.406556 -0.402159 -0.203198
obsrvation
Cluster 0:
Cluster 1:
Cluster 2:
Cluster 3:
These observations provide a detailed understanding of how each cluster is represented by different sectors and their associated risk profiles, helping investors make informed decisions based on their investment strategy.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
# Load the dataset
df = pd.read_csv('/content/stock_data.csv')
# List of numeric columns
numeric_columns = ['Current Price', 'Price Change', 'Volatility', 'ROE', 'Cash Ratio',
'Net Cash Flow', 'Net Income', 'Earnings Per Share', 'Estimated Shares Outstanding',
'P/E Ratio', 'P/B Ratio']
# Define the percentile range for Winsorization
lower_percentile = 5
upper_percentile = 95
# Save the original data for comparison
df_original = df.copy()
# Apply Winsorization to each numeric column
for col in numeric_columns:
# Calculate lower and upper percentiles
lower_limit = np.percentile(df[col], lower_percentile)
upper_limit = np.percentile(df[col], upper_percentile)
# Apply Winsorization
df[col] = np.clip(df[col], lower_limit, upper_limit)
# Scaling the features
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df[numeric_columns]), columns=numeric_columns)
# Apply K-means with the optimal number of clusters
optimal_k = 4 # Based on elbow and silhouette results
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
df_scaled['Cluster'] = kmeans.fit_predict(df_scaled)
# Cluster profiling by averaging feature values in each cluster
cluster_profile = df_scaled.groupby('Cluster').mean()
display(cluster_profile)
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Cluster | |||||||||||
| 0 | 1.463051 | 0.402817 | -0.105891 | 0.120887 | 0.433868 | -0.006664 | 0.032674 | 1.080188 | -0.502194 | 0.053249 | 0.329847 |
| 1 | -0.161363 | 0.319602 | -0.424824 | 0.215796 | 0.321018 | 0.274997 | 1.858444 | 0.378249 | 1.826490 | -0.383427 | -0.197792 |
| 2 | -0.321048 | -0.003481 | -0.245735 | -0.189374 | -0.284924 | -0.043623 | -0.319839 | -0.172800 | -0.334743 | -0.248716 | -0.169503 |
| 3 | -0.706172 | -1.083927 | 1.944672 | 0.451860 | 0.253538 | -0.127385 | -0.869945 | -1.480528 | 0.159900 | 1.633827 | 0.529428 |
Scatter plots for the selected features from the cluster profiling data:
After applying Winsorization and scaling
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Updated Data based on the cluster profile
data = {
'Cluster': [0, 1, 2, 3],
'Current Price': [1.463051, -0.161363, -0.321048, -0.706172],
'Price Change': [0.402817, 0.319602, -0.003481, -1.083927],
'Volatility': [-0.105891, -0.424824, -0.245735, 1.944672],
'ROE': [0.120887, 0.215796, -0.189374, 0.451860],
'Cash Ratio': [0.433868, 0.321018, -0.284924, 0.253538],
'Net Cash Flow': [-0.006664, 0.274997, -0.043623, -0.127385],
'Net Income': [0.032674, 1.858444, -0.319839, -0.869945],
'Earnings Per Share': [1.080188, 0.378249, -0.172800, -1.480528],
'Estimated Shares Outstanding': [-0.502194, 1.826490, -0.334743, 0.159900],
'P/E Ratio': [0.053249, -0.383427, -0.248716, 1.633827],
'P/B Ratio': [0.329847, -0.197792, -0.169503, 0.529428]
}
# Creating a DataFrame
df_cluster_profile = pd.DataFrame(data)
# Plotting scatter plots of selected features
plt.figure(figsize=(14, 10))
# Scatter plot for 'Current Price' vs 'Price Change'
plt.subplot(2, 2, 1)
sns.scatterplot(x='Current Price', y='Price Change', hue='Cluster', data=df_cluster_profile, palette='Set1')
plt.title('Current Price vs Price Change')
# Scatter plot for 'ROE' vs 'Net Income'
plt.subplot(2, 2, 2)
sns.scatterplot(x='ROE', y='Net Income', hue='Cluster', data=df_cluster_profile, palette='Set2')
plt.title('ROE vs Net Income')
# Scatter plot for 'Volatility' vs 'Earnings Per Share'
plt.subplot(2, 2, 3)
sns.scatterplot(x='Volatility', y='Earnings Per Share', hue='Cluster', data=df_cluster_profile, palette='Set3')
plt.title('Volatility vs Earnings Per Share')
# Scatter plot for 'P/E Ratio' vs 'P/B Ratio'
plt.subplot(2, 2, 4)
sns.scatterplot(x='P/E Ratio', y='P/B Ratio', hue='Cluster', data=df_cluster_profile, palette='Set1')
plt.title('P/E Ratio vs P/B Ratio')
plt.tight_layout()
plt.show()
we analyze the changes in feature values for each cluster. a summary of the comparison is presented below:
| Cluster | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.437 | -1.622 | 2.343 | 0.700 | -0.221 | -0.111 | -1.368 | -1.862 | -0.114 | 1.310 | 0.219 |
| 1 | 1.658 | 0.795 | 0.367 | -0.137 | 2.382 | 0.360 | -0.132 | 0.509 | -0.012 | 1.015 | 1.202 |
| 2 | -0.088 | 0.083 | -0.233 | -0.053 | -0.187 | 0.003 | 0.010 | 0.129 | -0.162 | -0.199 | -0.117 |
| 3 | -0.310 | 0.139 | -0.669 | -0.088 | 0.065 | -0.580 | 3.390 | 0.209 | 4.407 | -0.402 | -0.203 |
| Cluster | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.463 | 0.403 | -0.106 | 0.121 | 0.434 | -0.007 | 0.033 | 1.080 | -0.502 | 0.053 | 0.330 |
| 1 | -0.161 | 0.320 | -0.425 | 0.216 | 0.321 | 0.275 | 1.858 | 0.378 | 1.826 | -0.383 | -0.198 |
| 2 | -0.321 | -0.003 | -0.246 | -0.189 | -0.285 | -0.044 | -0.320 | -0.173 | -0.335 | -0.249 | -0.170 |
| 3 | -0.706 | -1.084 | 1.945 | 0.452 | 0.254 | -0.127 | -0.870 | -1.481 | 0.160 | 1.634 | 0.529 |
Current Price:
Price Change:
Volatility:
ROE:
Cash Ratio:
Net Cash Flow:
Net Income:
Earnings Per Share:
Estimated Shares Outstanding:
P/E Ratio:
P/B Ratio:
Summary: Winsorization and scaling have led to a normalization of feature values across clusters, reducing extreme values and making the data more comparable. This process helps in improving the consistency and comparability of cluster profiles, potentially leading to more reliable clustering results.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
# Load the dataset
df = pd.read_csv('/content/stock_data.csv')
# Define the numeric columns
numeric_columns = ['Current Price', 'Price Change', 'Volatility', 'ROE',
'Cash Ratio', 'Net Cash Flow', 'Net Income', 'Earnings Per Share',
'Estimated Shares Outstanding', 'P/E Ratio', 'P/B Ratio']
# Apply Winsorization to each numeric column
lower_percentile = 5
upper_percentile = 95
df_winsorized = df.copy()
for col in numeric_columns:
lower_limit = np.percentile(df_winsorized[col], lower_percentile)
upper_limit = np.percentile(df_winsorized[col], upper_percentile)
df_winsorized[col] = np.clip(df_winsorized[col], lower_limit, upper_limit)
# Scaling the features for Winsorized data
scaler = StandardScaler()
df_winsorized_scaled = pd.DataFrame(scaler.fit_transform(df_winsorized[numeric_columns]), columns=numeric_columns)
df_winsorized_scaled['GICS Sector'] = df_winsorized['GICS Sector']
# Apply K-means clustering on the original and Winsorized/scaled data
optimal_k = 4
kmeans_original = KMeans(n_clusters=optimal_k, random_state=42)
df['Cluster'] = kmeans_original.fit_predict(df[numeric_columns])
kmeans_winsorized = KMeans(n_clusters=optimal_k, random_state=42)
df_winsorized_scaled['Cluster'] = kmeans_winsorized.fit_predict(df_winsorized_scaled[numeric_columns])
# Create cross-tabulations for GICS Sector and Cluster
sector_distribution_original = pd.crosstab(df['GICS Sector'], df['Cluster'])
sector_distribution_winsorized = pd.crosstab(df_winsorized_scaled['GICS Sector'], df_winsorized_scaled['Cluster'])
# Plotting the sector distribution in each cluster
plt.figure(figsize=(16, 8))
# Subplot for distribution before Winsorization
plt.subplot(1, 2, 1)
sector_distribution_original.plot(kind='bar', stacked=True, ax=plt.gca(), cmap='viridis')
plt.title('Distribution of GICS Sector Across Clusters (Original)')
plt.ylabel('Number of Companies')
plt.xlabel('GICS Sector')
plt.xticks(rotation=45)
plt.legend(title='Cluster', bbox_to_anchor=(1.05, 1), loc='upper left')
# Subplot for distribution after Winsorization
plt.subplot(1, 2, 2)
sector_distribution_winsorized.plot(kind='bar', stacked=True, ax=plt.gca(), cmap='viridis')
plt.title('Distribution of GICS Sector Across Clusters (Winsorized)')
plt.ylabel('Number of Companies')
plt.xlabel('GICS Sector')
plt.xticks(rotation=45)
plt.legend(title='Cluster', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
Observation
Based on the comparison of results before and after Winsorization, it is clear that Winsorization significantly alters the cluster profiling, especially in a dataset related to stock performance. Since stock data inherently contains high volatility and extreme values, capping these outliers using Winsorization may lead to a loss of critical information that is important for understanding market behaviors and stock performance variations.
To get better understading i will create Side-by-Side Bar Plot or Cluster Profile Comparison Plot for comparering the distribution of GICS Sectors across clusters before and after Winsorization.
Side-by-Side Bar Plot or Cluster Profile Comparison Plot.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
# Load the dataset
df = pd.read_csv('/content/stock_data.csv')
# Define the numeric columns
numeric_columns = ['Current Price', 'Price Change', 'Volatility', 'ROE', 'Cash Ratio',
'Net Cash Flow', 'Net Income', 'Earnings Per Share', 'Estimated Shares Outstanding',
'P/E Ratio', 'P/B Ratio']
# Function to apply K-Means clustering
def apply_kmeans(df, n_clusters=4):
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
df['Cluster'] = kmeans.fit_predict(df[numeric_columns])
return df
# Scaling the features
scaler = StandardScaler()
# Cluster profiling before Winsorization
df_scaled_before = pd.DataFrame(scaler.fit_transform(df[numeric_columns]), columns=numeric_columns)
df_scaled_before['GICS Sector'] = df['GICS Sector']
df_scaled_before = apply_kmeans(df_scaled_before)
# Winsorization
def winsorize(df, columns, lower_percentile=5, upper_percentile=95):
df_winsorized = df.copy()
for col in columns:
lower_limit = np.percentile(df[col], lower_percentile)
upper_limit = np.percentile(df[col], upper_percentile)
df_winsorized[col] = np.clip(df[col], lower_limit, upper_limit)
return df_winsorized
df_winsorized = winsorize(df, numeric_columns)
df_scaled_after = pd.DataFrame(scaler.fit_transform(df_winsorized[numeric_columns]), columns=numeric_columns)
df_scaled_after['GICS Sector'] = df_winsorized['GICS Sector']
df_scaled_after = apply_kmeans(df_scaled_after)
# Cluster profiling
def cluster_profiling(df):
return df.groupby('Cluster')[numeric_columns].mean()
profile_before = cluster_profiling(df_scaled_before)
profile_after = cluster_profiling(df_scaled_after)
# Comparing cluster profiles before and after Winsorization
def plot_cluster_profiles(profile_before, profile_after):
num_features = len(numeric_columns)
fig, axes = plt.subplots(num_features, 2, figsize=(14, 2*num_features), sharey=True)
fig.suptitle('Cluster Profiling Before and After Winsorization', fontsize=16)
for i, feature in enumerate(numeric_columns):
axes[i, 0].bar(profile_before.index, profile_before[feature], color='skyblue')
axes[i, 0].set_title(f'Before Winsorization\n{feature}')
axes[i, 0].set_ylabel('Value')
axes[i, 1].bar(profile_after.index, profile_after[feature], color='lightcoral')
axes[i, 1].set_title(f'After Winsorization\n{feature}')
for ax in axes[i, :]:
ax.set_xlabel('Cluster')
ax.grid(True)
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()
# Plot results
plot_cluster_profiles(profile_before, profile_after)
Stock Data Characteristics:
Changes in Cluster Profiling:
Impact on Sector Distribution:
Loss of Important Information:
Given the importance of extreme values in stock market data, proceeding with clustering without Winsorization is more appropriate for the following reasons:
Retaining Important Market Insights:
Understanding Risk and Volatility:
Better Sector Differentiation:
Original Data: The GICS Sector distribution before Winsorization shows clear sector preferences across different clusters, with specific sectors being more dominant in certain clusters. This helps in understanding how different industries are performing and reacting to the market, which is crucial for targeted investment strategies.
Winsorized Data: Post-Winsorization, the sector distribution becomes more evenly spread across clusters, reducing the ability to identify sector-based trends. This uniformity in distribution can hide important sector-specific information, limiting insights that are critical for sector-based decision-making.
Given the high relevance of outliers in stock data and the impact that Winsorization has on reducing valuable sector- and risk-related information, it is advisable to proceed with clustering without Winsorization. By doing so, we maintain the integrity of the data, ensuring that extreme but meaningful values are included in the analysis, which can provide deeper insights into stock performance, sector trends, and risk assessment.
This approach will lead to more accurate clustering results, especially for decision-making in stock market analysis, where understanding volatility, sector dynamics, and outliers is key.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import linkage, dendrogram, cophenet, fcluster
from scipy.spatial.distance import pdist
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
# Load the dataset
df = pd.read_csv('/content/stock_data.csv')
# Identify numeric columns for scaling
numeric_columns = [
'Current Price', 'Price Change', 'Volatility', 'ROE', 'Cash Ratio',
'Net Cash Flow', 'Net Income', 'Earnings Per Share',
'Estimated Shares Outstanding', 'P/E Ratio', 'P/B Ratio'
]
# Filter only numeric columns
df_numeric = df[numeric_columns]
# Scale the numeric features
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df_numeric), columns=numeric_columns)
# Perform hierarchical clustering with single linkage
Z_single = linkage(df_scaled, method='single')
# Plot dendrogram
plt.figure(figsize=(10, 7))
dendrogram(Z_single)
plt.title('Dendrogram (Single Linkage)')
plt.xlabel('Sample index')
plt.ylabel('Distance')
plt.show()
# Check cophenetic correlation
c_single, _ = cophenet(Z_single, pdist(df_scaled))
print(f'Cophenetic Correlation (Single Linkage): {c_single:.2f}')
# Determine the appropriate number of clusters
# Example: Using 4 clusters, adjust as needed
clusters_single = fcluster(Z_single, 4, criterion='maxclust')
# Add cluster labels to the original dataframe
df['Cluster_Single'] = clusters_single
# Cluster profiling
# Only numeric columns are used for profiling
cluster_profile_single = df.groupby('Cluster_Single')[numeric_columns].mean()
print(f'Cluster Profiling (Single Linkage):\n', cluster_profile_single)
Cophenetic Correlation (Single Linkage): 0.92
Cluster Profiling (Single Linkage):
Current Price Price Change Volatility ROE \
Cluster_Single
1 76.846312 4.052844 1.525345 37.053412
2 276.570007 6.189286 1.116976 30.000000
3 44.470001 11.397804 2.405408 917.000000
4 1274.949951 3.190527 1.268340 29.000000
Cash Ratio Net Cash Flow Net Income Earnings Per Share \
Cluster_Single
1 69.789318 5.865072e+07 1.568163e+09 2.807908
2 25.000000 9.088500e+07 5.965410e+08 8.910000
3 80.000000 6.980000e+08 -2.352800e+10 -61.200000
4 184.000000 -1.671386e+09 2.551360e+09 50.090000
Estimated Shares Outstanding P/E Ratio P/B Ratio
Cluster_Single
1 5.806745e+08 32.459017 -2.128154
2 6.695185e+07 31.040405 129.064585
3 3.844444e+08 93.089287 4.970809
4 5.093552e+07 25.453183 -1.052429
2. Complete Linkage
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import linkage, dendrogram, cophenet, fcluster
from scipy.spatial.distance import pdist
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
# Load the dataset
df = pd.read_csv('/content/stock_data.csv')
# Identify numeric columns for scaling
numeric_columns = [
'Current Price', 'Price Change', 'Volatility', 'ROE', 'Cash Ratio',
'Net Cash Flow', 'Net Income', 'Earnings Per Share',
'Estimated Shares Outstanding', 'P/E Ratio', 'P/B Ratio'
]
# Filter only numeric columns
df_numeric = df[numeric_columns]
# Scale the numeric features
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df_numeric), columns=numeric_columns)
# Perform hierarchical clustering with complete linkage
Z_complete = linkage(df_scaled, method='complete')
# Plot dendrogram
plt.figure(figsize=(10, 7))
dendrogram(Z_complete)
plt.title('Dendrogram (Complete Linkage)')
plt.xlabel('Sample index')
plt.ylabel('Distance')
plt.show()
# Check cophenetic correlation
c_complete, _ = cophenet(Z_complete, pdist(df_scaled))
print(f'Cophenetic Correlation (Complete Linkage): {c_complete:.2f}')
# Determine the appropriate number of clusters
# Example: Using 4 clusters, adjust as needed
clusters_complete = fcluster(Z_complete, 4, criterion='maxclust')
# Add cluster labels to the original dataframe
df['Cluster_Complete'] = clusters_complete
# Cluster profiling
cluster_profile_complete = df.groupby('Cluster_Complete')[numeric_columns].mean()
print(f'Cluster Profiling (Complete Linkage):\n', cluster_profile_complete)
Cophenetic Correlation (Complete Linkage): 0.79
Cluster Profiling (Complete Linkage):
Current Price Price Change Volatility ROE \
Cluster_Complete
1 26.990000 -14.060688 3.296307 603.000000
2 109.283336 9.555034 1.253895 18.333333
3 77.505788 4.194932 1.513252 34.744745
4 1274.949951 3.190527 1.268340 29.000000
Cash Ratio Net Cash Flow Net Income Earnings Per Share \
Cluster_Complete
1 57.333333 -5.850000e+08 -1.755567e+10 -39.726667
2 95.333333 1.120063e+10 9.301514e+09 5.166667
3 69.567568 -3.391204e+07 1.592498e+09 2.995961
4 184.000000 -1.671386e+09 2.551360e+09 50.090000
Estimated Shares Outstanding P/E Ratio P/B Ratio
Cluster_Complete
1 4.819101e+08 71.528835 1.638633
2 1.883537e+09 19.446599 44.027184
3 5.676948e+08 32.402078 -2.162612
4 5.093552e+07 25.453183 -1.052429
3. Average Linkage
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import linkage, dendrogram, cophenet, fcluster
from scipy.spatial.distance import pdist
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
# Load the dataset
df = pd.read_csv('/content/stock_data.csv')
# Identify numeric columns for scaling
numeric_columns = [
'Current Price', 'Price Change', 'Volatility', 'ROE', 'Cash Ratio',
'Net Cash Flow', 'Net Income', 'Earnings Per Share',
'Estimated Shares Outstanding', 'P/E Ratio', 'P/B Ratio'
]
# Filter only numeric columns
df_numeric = df[numeric_columns]
# Scale the numeric features
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df_numeric), columns=numeric_columns)
# Perform hierarchical clustering with average linkage
Z_average = linkage(df_scaled, method='average')
# Plot dendrogram
plt.figure(figsize=(10, 7))
dendrogram(Z_average)
plt.title('Dendrogram (Average Linkage)')
plt.xlabel('Sample index')
plt.ylabel('Distance')
plt.show()
# Check cophenetic correlation
c_average, _ = cophenet(Z_average, pdist(df_scaled))
print(f'Cophenetic Correlation (Average Linkage): {c_average:.2f}')
# Determine the appropriate number of clusters
# Example: Using 4 clusters, adjust as needed
clusters_average = fcluster(Z_average, 4, criterion='maxclust')
# Add cluster labels to the original dataframe
df['Cluster_Average'] = clusters_average
# Cluster profiling
cluster_profile_average = df.groupby('Cluster_Average')[numeric_columns].mean()
print(f'Cluster Profiling (Average Linkage):\n', cluster_profile_average)
Cophenetic Correlation (Average Linkage): 0.94
Cluster Profiling (Average Linkage):
Current Price Price Change Volatility ROE \
Cluster_Average
1 24.485001 -13.351992 3.482611 802.000000
2 77.573266 4.148438 1.515708 35.184524
3 104.660004 16.224320 1.320606 8.000000
4 1274.949951 3.190527 1.268340 29.000000
Cash Ratio Net Cash Flow Net Income Earnings Per Share \
Cluster_Average
1 51.000000 -1.292500e+09 -1.910650e+10 -41.81500
2 67.154762 6.710469e+07 1.607391e+09 2.90564
3 958.000000 5.920000e+08 3.669000e+09 1.31000
4 184.000000 -1.671386e+09 2.551360e+09 50.09000
Estimated Shares Outstanding P/E Ratio P/B Ratio
Cluster_Average
1 5.195740e+08 60.748608 1.565141
2 5.723178e+08 32.325679 -1.762402
3 2.800763e+09 79.893133 5.884467
4 5.093552e+07 25.453183 -1.052429
4. Ward Linkage
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import linkage, dendrogram, cophenet, fcluster
from scipy.spatial.distance import pdist
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
# Load the dataset
df = pd.read_csv('/content/stock_data.csv')
# Identify numeric columns for scaling
numeric_columns = [
'Current Price', 'Price Change', 'Volatility', 'ROE', 'Cash Ratio',
'Net Cash Flow', 'Net Income', 'Earnings Per Share',
'Estimated Shares Outstanding', 'P/E Ratio', 'P/B Ratio'
]
# Filter only numeric columns
df_numeric = df[numeric_columns]
# Scale the numeric features
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df_numeric), columns=numeric_columns)
# Perform hierarchical clustering with ward linkage
Z_ward = linkage(df_scaled, method='ward')
# Plot dendrogram
plt.figure(figsize=(10, 7))
dendrogram(Z_ward)
plt.title('Dendrogram (Ward Linkage)')
plt.xlabel('Sample index')
plt.ylabel('Distance')
plt.show()
# Check cophenetic correlation
c_ward, _ = cophenet(Z_ward, pdist(df_scaled))
print(f'Cophenetic Correlation (Ward Linkage): {c_ward:.2f}')
# Determine the appropriate number of clusters
# Example: Using 4 clusters, adjust as needed
clusters_ward = fcluster(Z_ward, 4, criterion='maxclust')
# Add cluster labels to the original dataframe
df['Cluster_Ward'] = clusters_ward
# Cluster profiling
cluster_profile_ward = df.groupby('Cluster_Ward')[numeric_columns].mean()
print(f'Cluster Profiling (Ward Linkage):\n', cluster_profile_ward)
Cophenetic Correlation (Ward Linkage): 0.71
Cluster Profiling (Ward Linkage):
Current Price Price Change Volatility ROE Cash Ratio \
Cluster_Ward
1 48.006208 -11.263107 2.590247 196.551724 40.275862
2 42.848182 6.270446 1.123547 22.727273 71.454545
3 72.760400 5.213307 1.427078 25.603509 60.392982
4 326.198218 10.563242 1.642560 14.400000 309.466667
Net Cash Flow Net Income Earnings Per Share \
Cluster_Ward
1 -4.959017e+08 -3.597245e+09 -8.689655
2 5.586364e+08 1.463127e+10 3.410000
3 7.995151e+07 1.538594e+09 3.655351
4 2.888507e+08 8.644985e+08 7.785333
Estimated Shares Outstanding P/E Ratio P/B Ratio
Cluster_Ward
1 4.863198e+08 75.110924 -2.162622
2 4.242573e+09 15.242169 -4.924615
3 4.464721e+08 24.722670 -2.647194
4 5.449003e+08 113.095334 19.142151
5. Centroid Linkage
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import linkage, dendrogram, cophenet, fcluster
from scipy.spatial.distance import pdist
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
# Load the dataset
df = pd.read_csv('/content/stock_data.csv')
# Identify numeric columns for scaling
numeric_columns = [
'Current Price', 'Price Change', 'Volatility', 'ROE', 'Cash Ratio',
'Net Cash Flow', 'Net Income', 'Earnings Per Share',
'Estimated Shares Outstanding', 'P/E Ratio', 'P/B Ratio'
]
# Filter only numeric columns
df_numeric = df[numeric_columns]
# Scale the numeric features
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df_numeric), columns=numeric_columns)
# Perform hierarchical clustering with centroid linkage
Z_centroid = linkage(df_scaled, method='centroid')
# Plot dendrogram
plt.figure(figsize=(10, 7))
dendrogram(Z_centroid)
plt.title('Dendrogram (Centroid Linkage)')
plt.xlabel('Sample index')
plt.ylabel('Distance')
plt.show()
# Check cophenetic correlation
c_centroid, _ = cophenet(Z_centroid, pdist(df_scaled))
print(f'Cophenetic Correlation (Centroid Linkage): {c_centroid:.2f}')
# Determine the appropriate number of clusters
# Example: Using 4 clusters, adjust as needed
clusters_centroid = fcluster(Z_centroid, 4, criterion='maxclust')
# Add cluster labels to the original dataframe
df['Cluster_Centroid'] = clusters_centroid
# Cluster profiling
cluster_profile_centroid = df.groupby('Cluster_Centroid')[numeric_columns].mean()
print(f'Cluster Profiling (Centroid Linkage):\n', cluster_profile_centroid)
Cophenetic Correlation (Centroid Linkage): 0.93
Cluster Profiling (Centroid Linkage):
Current Price Price Change Volatility ROE \
Cluster_Centroid
1 26.990000 -14.060688 3.296307 603.000000
2 77.709305 4.207024 1.511505 34.677612
3 104.660004 16.224320 1.320606 8.000000
4 1274.949951 3.190527 1.268340 29.000000
Cash Ratio Net Cash Flow Net Income Earnings Per Share \
Cluster_Centroid
1 57.333333 -5.850000e+08 -1.755567e+10 -39.726667
2 67.146269 6.482739e+07 1.655336e+09 3.020433
3 958.000000 5.920000e+08 3.669000e+09 1.310000
4 184.000000 -1.671386e+09 2.551360e+09 50.090000
Estimated Shares Outstanding P/E Ratio P/B Ratio
Cluster_Centroid
1 4.819101e+08 71.528835 1.638633
2 5.728126e+08 32.144295 -1.772993
3 2.800763e+09 79.893133 5.884467
4 5.093552e+07 25.453183 -1.052429
6. Median Linkage
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import linkage, dendrogram, cophenet, fcluster
from scipy.spatial.distance import pdist
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
# Load the dataset
df = pd.read_csv('/content/stock_data.csv')
# Identify numeric columns for scaling
numeric_columns = [
'Current Price', 'Price Change', 'Volatility', 'ROE', 'Cash Ratio',
'Net Cash Flow', 'Net Income', 'Earnings Per Share',
'Estimated Shares Outstanding', 'P/E Ratio', 'P/B Ratio'
]
# Filter only numeric columns
df_numeric = df[numeric_columns]
# Scale the numeric features
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df_numeric), columns=numeric_columns)
# Perform hierarchical clustering with median linkage
Z_median = linkage(df_scaled, method='median')
# Plot dendrogram
plt.figure(figsize=(10, 7))
dendrogram(Z_median)
plt.title('Dendrogram (Median Linkage)')
plt.xlabel('Sample index')
plt.ylabel('Distance')
plt.show()
# Check cophenetic correlation
c_median, _ = cophenet(Z_median, pdist(df_scaled))
print(f'Cophenetic Correlation (Median Linkage): {c_median:.2f}')
# Determine the appropriate number of clusters
# Example: Using 4 clusters, adjust as needed
clusters_median = fcluster(Z_median, 4, criterion='maxclust')
# Add cluster labels to the original dataframe
df['Cluster_Median'] = clusters_median
# Cluster profiling
cluster_profile_median = df.groupby('Cluster_Median')[numeric_columns].mean()
print(f'Cluster Profiling (Median Linkage):\n', cluster_profile_median)
Cophenetic Correlation (Median Linkage): 0.92
Cluster Profiling (Median Linkage):
Current Price Price Change Volatility ROE \
Cluster_Median
1 24.485001 -13.351992 3.482611 802.000000
2 77.653642 4.184271 1.515129 35.103858
3 1274.949951 3.190527 1.268340 29.000000
Cash Ratio Net Cash Flow Net Income Earnings Per Share \
Cluster_Median
1 51.00000 -1.292500e+09 -1.910650e+10 -41.815000
2 69.79822 6.866225e+07 1.613509e+09 2.900905
3 184.00000 -1.671386e+09 2.551360e+09 50.090000
Estimated Shares Outstanding P/E Ratio P/B Ratio
Cluster_Median
1 5.195740e+08 60.748608 1.565141
2 5.789304e+08 32.466828 -1.739711
3 5.093552e+07 25.453183 -1.052429
You compare several things, like:
You can also mention any differences or similarities you obtained in the cluster profiles from both the clustering techniques.
Question 1: Which clustering technique took less time for execution?
Comparison of the execution time for K mean and Hierarchical Clustering methods
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import linkage, dendrogram, cophenet, fcluster
from scipy.spatial.distance import pdist
import time
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
# Load the dataset
df = pd.read_csv('/content/stock_data.csv')
# Identify numeric columns for scaling
numeric_columns = [
'Current Price', 'Price Change', 'Volatility', 'ROE', 'Cash Ratio',
'Net Cash Flow', 'Net Income', 'Earnings Per Share',
'Estimated Shares Outstanding', 'P/E Ratio', 'P/B Ratio'
]
# Filter only numeric columns
df_numeric = df[numeric_columns]
# Scale the numeric features
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df_numeric), columns=numeric_columns)
# K-means clustering
start_time_kmeans = time.time()
optimal_k = 3 # Based on elbow and silhouette results
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
df_scaled['Cluster_KMeans'] = kmeans.fit_predict(df_scaled)
end_time_kmeans = time.time()
execution_time_kmeans = end_time_kmeans - start_time_kmeans
# Hierarchical clustering with various linkage methods
linkage_methods = ['single', 'complete', 'average', 'ward', 'centroid', 'median']
execution_times = {}
for method in linkage_methods:
start_time = time.time()
Z = linkage(df_scaled, method=method)
end_time = time.time()
execution_times[method] = end_time - start_time
# Assign clusters and profile
clusters = fcluster(Z, 4, criterion='maxclust')
df[f'Cluster_{method.capitalize()}'] = clusters
cluster_profile = df.groupby(f'Cluster_{method.capitalize()}')[numeric_columns].mean()
#print(f'Cluster Profiling ({method.capitalize()} Linkage):\n', cluster_profile)
# Print execution times
print(f"Execution time for K-means clustering: {execution_time_kmeans:.4f} seconds")
for method, exec_time in execution_times.items():
print(f"Execution time for hierarchical clustering ({method.capitalize()} Linkage): {exec_time:.4f} seconds")
# Create a DataFrame to summarize execution times
execution_times_summary = pd.DataFrame({
'Method': ['K-means'] + linkage_methods,
'Execution Time (seconds)': [execution_time_kmeans] + [execution_times[method] for method in linkage_methods]
})
print("\nExecution Times Summary:")
print(execution_times_summary)
Execution time for K-means clustering: 0.0165 seconds
Execution time for hierarchical clustering (Single Linkage): 0.0016 seconds
Execution time for hierarchical clustering (Complete Linkage): 0.0025 seconds
Execution time for hierarchical clustering (Average Linkage): 0.0025 seconds
Execution time for hierarchical clustering (Ward Linkage): 0.0026 seconds
Execution time for hierarchical clustering (Centroid Linkage): 0.0028 seconds
Execution time for hierarchical clustering (Median Linkage): 0.0036 seconds
Execution Times Summary:
Method Execution Time (seconds)
0 K-means 0.016465
1 single 0.001605
2 complete 0.002483
3 average 0.002513
4 ward 0.002565
5 centroid 0.002788
6 median 0.003632
Hierarchical clustering methods, including Single Linkage, Complete Linkage, Average Linkage, Ward Linkage, Centroid Linkage, and Median Linkage, demonstrate significantly faster execution times compared to K-means clustering. Hierarchical methods range from 0.0015 to 0.0036 seconds, whereas K-means takes approximately 0.0168 seconds. This indicates that hierarchical clustering methods are computationally more efficient than K-means for the given data.
K-means Clustering: Although K-means is computationally efficient, it does not have an associated cophenetic correlation coefficient. Thus, its cluster distinctiveness cannot be directly compared to hierarchical methods using this metric.
Hierarchical Clustering Methods:
In summary, hierarchical clustering methods generally offer better execution times compared to K-means clustering. Among the hierarchical methods, Average Linkage and Centroid Linkage provide the most distinct clusters, as indicated by their higher cophenetic correlation coefficients. Single Linkage is the fastest but less effective in maintaining cluster distinctiveness. Complete Linkage and Ward Linkage result in less distinct clusters, while Median Linkage is comparable to the best hierarchical methods but slightly less effective.
Question 2: Which clustering technique gave you more distinct clusters, or are they the same?
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import linkage, dendrogram, cophenet, fcluster
from scipy.spatial.distance import pdist
import time
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
# Load the dataset
df = pd.read_csv('/content/stock_data.csv')
# Identify numeric columns for scaling
numeric_columns = [
'Current Price', 'Price Change', 'Volatility', 'ROE', 'Cash Ratio',
'Net Cash Flow', 'Net Income', 'Earnings Per Share',
'Estimated Shares Outstanding', 'P/E Ratio', 'P/B Ratio'
]
# Filter only numeric columns
df_numeric = df[numeric_columns]
# Scale the numeric features
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df_numeric), columns=numeric_columns)
# Initialize a dictionary to store results
results = {'Method': [], 'Execution Time (seconds)': [], 'Cophenetic Correlation': []}
# K-means clustering
optimal_k = 3 # Based on elbow and silhouette results
start_time_kmeans = time.time()
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
df_scaled['Cluster_KMeans'] = kmeans.fit_predict(df_scaled)
end_time_kmeans = time.time()
execution_time_kmeans = end_time_kmeans - start_time_kmeans
results['Method'].append('K-means')
results['Execution Time (seconds)'].append(execution_time_kmeans)
results['Cophenetic Correlation'].append('N/A') # Cophenetic correlation is not applicable for K-means
# Hierarchical clustering methods and their execution times and cophenetic correlations
linkage_methods = ['single', 'complete', 'average', 'ward', 'centroid', 'median']
for method in linkage_methods:
start_time = time.time()
Z = linkage(df_scaled, method=method)
end_time = time.time()
execution_time = end_time - start_time
cophenetic_corr, _ = cophenet(Z, pdist(df_scaled))
results['Method'].append(f'Hierarchical - {method.capitalize()} Linkage')
results['Execution Time (seconds)'].append(execution_time)
results['Cophenetic Correlation'].append(cophenetic_corr)
# Optional: Plot dendrogram
plt.figure(figsize=(10, 7))
dendrogram(Z)
plt.title(f'Dendrogram ({method.capitalize()} Linkage)')
plt.xlabel('Sample index')
plt.ylabel('Distance')
plt.show()
# Convert results to a DataFrame for better visualization
results_df = pd.DataFrame(results)
print(results_df)
Method Execution Time (seconds) \ 0 K-means 0.016045 1 Hierarchical - Single Linkage 0.002041 2 Hierarchical - Complete Linkage 0.004452 3 Hierarchical - Average Linkage 0.004860 4 Hierarchical - Ward Linkage 0.007531 5 Hierarchical - Centroid Linkage 0.006528 6 Hierarchical - Median Linkage 0.004285 Cophenetic Correlation 0 N/A 1 0.925822 2 0.869896 3 0.942952 4 0.709546 5 0.944282 6 0.894304
Observation: Hierarchical clustering methods generally exhibit faster execution times compared to K-means. This is due to the more straightforward computations of hierarchical methods compared to the iterative nature of K-means.
The Cophenetic Correlation Coefficient evaluates how well the clustering method preserves the pairwise distances between data points. Higher values indicate better cluster distinctiveness.
Observation: Among hierarchical methods, Average Linkage and Centroid Linkage exhibit the highest cophenetic correlation coefficients, indicating they maintain the original distances between data points effectively and produce distinct clusters. Single Linkage and Centroid Linkage also perform well but are slightly less effective than Average Linkage. Complete Linkage and Ward Linkage show lower cophenetic correlation coefficients, suggesting less effective cluster distinctiveness. Median Linkage provides results similar to Average and Centroid Linkage but with a slightly lower cophenetic correlation.
K-means Clustering: While computationally efficient, K-means may produce less distinct clusters, particularly when data does not fit the algorithm’s assumptions of spherical shape and uniform density.
Hierarchical Clustering Methods:
In summary, hierarchical clustering methods, especially Average Linkage and Centroid Linkage, generally offer more distinct clusters compared to K-means Clustering, as indicated by their higher cophenetic correlation coefficients. These hierarchical methods better preserve the original distance relationships among data points, leading to more accurate clustering. On the other hand, K-means is computationally efficient but may struggle with clusters that do not fit its assumptions. Among hierarchical methods, Single Linkage is the fastest but has lower distinctiveness, while Average Linkage provides the best cluster quality in terms of distinctiveness.
Question 3:How many observations are there in the similar clusters of both algorithms?
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import linkage, fcluster, cophenet, dendrogram, centroid
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
# Sample data (replace this with your actual dataset)
data = pd.read_csv('stock_data.csv')
# Define K-means clustering
kmeans = KMeans(n_clusters=3, random_state=42).fit(data[['Current Price', 'Price Change']])
data['KMeans_Cluster'] = kmeans.labels_
# Function to perform hierarchical clustering using scipy
def hierarchical_clustering(data, method):
Z = linkage(data[['Current Price', 'Price Change']], method=method)
data[f'Hierarchical_{method}_Cluster'] = fcluster(Z, t=3, criterion='maxclust')
return data
# Apply hierarchical clustering with supported methods
methods = ['ward', 'single', 'complete', 'average']
for method in methods:
data = hierarchical_clustering(data, method)
# Apply hierarchical clustering with centroid and median using scipy
def centroid_linkage(data):
Z = linkage(data[['Current Price', 'Price Change']], method='centroid')
data['Hierarchical_centroid_Cluster'] = fcluster(Z, t=3, criterion='maxclust')
return data
def median_linkage(data):
Z = linkage(data[['Current Price', 'Price Change']], method='median')
data['Hierarchical_median_Cluster'] = fcluster(Z, t=3, criterion='maxclust')
return data
data = centroid_linkage(data)
data = median_linkage(data)
# Function to calculate the number of similar clusters
def count_similar_clusters(df, method):
return df.groupby(['KMeans_Cluster', method]).size().reset_index(name='Count')
# Calculate similar clusters for each hierarchical method
linkage_methods = methods + ['centroid', 'median']
similar_clusters_summary = {}
for linkage in linkage_methods:
similar_clusters_summary[linkage] = count_similar_clusters(data, f'Hierarchical_{linkage}_Cluster')
# Display the results
for linkage, summary in similar_clusters_summary.items():
print(f"\nSimilar clusters for Hierarchical ({linkage}) vs K-means:")
print(summary)
Similar clusters for Hierarchical (ward) vs K-means: KMeans_Cluster Hierarchical_ward_Cluster Count 0 0 2 45 1 0 3 283 2 1 1 4 3 1 2 7 4 2 1 1 Similar clusters for Hierarchical (single) vs K-means: KMeans_Cluster Hierarchical_single_Cluster Count 0 0 1 328 1 1 1 7 2 1 2 4 3 2 3 1 Similar clusters for Hierarchical (complete) vs K-means: KMeans_Cluster Hierarchical_complete_Cluster Count 0 0 2 328 1 1 1 4 2 1 2 7 3 2 3 1 Similar clusters for Hierarchical (average) vs K-means: KMeans_Cluster Hierarchical_average_Cluster Count 0 0 2 328 1 1 1 4 2 1 2 7 3 2 3 1 Similar clusters for Hierarchical (centroid) vs K-means: KMeans_Cluster Hierarchical_centroid_Cluster Count 0 0 2 328 1 1 1 4 2 1 2 7 3 2 3 1 Similar clusters for Hierarchical (median) vs K-means: KMeans_Cluster Hierarchical_median_Cluster Count 0 0 2 328 1 1 1 4 2 1 2 7 3 2 3 1
Observations
Hierarchical (Ward) vs K-means:
0 in K-means overlaps most significantly with Cluster 2 and 3 from the Ward linkage method, showing a considerable number of observations: 45 in Cluster 2 and 283 in Cluster 3.1 in K-means overlaps with Cluster 1 and 2 from Ward linkage, with 4 and 7 observations, respectively.2 in K-means matches with Cluster 1 from Ward linkage, with only 1 observation.Hierarchical (Single Linkage) vs K-means:
0 in K-means primarily overlaps with Cluster 1 from Single Linkage, containing 328 observations.1 in K-means overlaps with Cluster 1 and 2 from Single Linkage, with 7 and 4 observations, respectively.2 in K-means corresponds to Cluster 3 from Single Linkage, with just 1 observation.Hierarchical (Complete Linkage) vs K-means:
0 in K-means aligns mostly with Cluster 2 from Complete Linkage, having 328 observations.1 in K-means overlaps with Cluster 1 and 2 from Complete Linkage, with 4 and 7 observations, respectively.2 in K-means matches with Cluster 3 from Complete Linkage, with only 1 observation.Hierarchical (Average Linkage) vs K-means:
0 in K-means is similar to Cluster 2 from Average Linkage, containing 328 observations.1 in K-means aligns with Cluster 1 and 2 from Average Linkage, with 4 and 7 observations, respectively.2 in K-means corresponds to Cluster 3 from Average Linkage, with just 1 observation.Hierarchical (Centroid) vs K-means:
0 in K-means overlaps significantly with Cluster 2 from Centroid Linkage, containing 328 observations.1 in K-means matches with Cluster 1 and 2 from Centroid Linkage, with 4 and 7 observations, respectively.2 in K-means corresponds to Cluster 3 from Centroid Linkage, with only 1 observation.Hierarchical (Median) vs K-means:
0 in K-means is similar to Cluster 2 from Median Linkage, with 328 observations.1 in K-means overlaps with Cluster 1 and 2 from Median Linkage, with 4 and 7 observations, respectively.2 in K-means matches with Cluster 3 from Median Linkage, having only 1 observation.Across all hierarchical clustering methods compared (Ward, Single, Complete, Average, Centroid, and Median), Cluster 0 in K-means consistently aligns with Cluster 2 from the hierarchical methods, showing a high number of observations (328). Cluster 1 in K-means also shows some overlap with clusters 1 and 2 from the hierarchical methods. Cluster 2 in K-means matches with Cluster 3 from the hierarchical methods but only has a few observations.
These comparisons indicate that the hierarchical methods, particularly Ward, Single, Complete, Average, Centroid, and Median, show some consistency in clustering compared to K-means, especially for larger clusters, while there is variability for smaller clusters.
Question 4: How many clusters are obtained as the appropriate number of clusters from both algorithms?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
# Load the dataset
data = pd.read_csv('stock_data.csv')
# Select numerical features (adjust based on your dataset)
features = data[['Current Price', 'Price Change', 'Volatility', 'ROE', 'Cash Ratio',
'Net Cash Flow', 'Net Income', 'Earnings Per Share', 'Estimated Shares Outstanding',
'P/E Ratio', 'P/B Ratio']]
# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(features)
# Range of cluster numbers
range_n_clusters = list(range(2, 11))
silhouette_avg = []
inertia = []
# Calculate silhouette scores and inertia for different numbers of clusters
for n_clusters in range_n_clusters:
kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(X_scaled)
cluster_labels = kmeans.labels_
silhouette_avg.append(silhouette_score(X_scaled, cluster_labels))
inertia.append(kmeans.inertia_)
# Plot the silhouette scores and inertia
plt.figure(figsize=(12, 8))
# Plot Silhouette Score
plt.plot(range_n_clusters, silhouette_avg, marker='o', color='blue', label='Silhouette Score')
plt.ylabel('Silhouette Score')
plt.xlabel('Number of Clusters')
plt.xticks(range_n_clusters)
plt.grid(True)
# Create a second y-axis for the Elbow Method
ax2 = plt.gca().twinx()
ax2.plot(range_n_clusters, inertia, marker='x', color='green', linestyle='--', label='Inertia')
ax2.set_ylabel('Inertia')
# Annotate the optimal number of clusters for silhouette score
optimal_n_clusters_silhouette = 3 # Annotated as the optimal number of clusters
plt.axvline(x=optimal_n_clusters_silhouette, color='blue', linestyle='--', label=f'Silhouette Optimal ({optimal_n_clusters_silhouette})')
plt.annotate(f'{optimal_n_clusters_silhouette}', xy=(optimal_n_clusters_silhouette, silhouette_avg[optimal_n_clusters_silhouette - 2]),
xytext=(optimal_n_clusters_silhouette + 0.5, silhouette_avg[optimal_n_clusters_silhouette - 2] + 0.05),
arrowprops=dict(facecolor='blue', shrink=0.05))
# Annotate the optimal number of clusters for the elbow method
plt.axvline(x=optimal_n_clusters_silhouette, color='green', linestyle='--', label=f'Elbow Optimal ({optimal_n_clusters_silhouette})')
ax2.annotate(f'{optimal_n_clusters_silhouette}', xy=(optimal_n_clusters_silhouette, inertia[optimal_n_clusters_silhouette - 2]),
xytext=(optimal_n_clusters_silhouette + 0.5, inertia[optimal_n_clusters_silhouette - 2] + 1000),
arrowprops=dict(facecolor='green', shrink=0.05))
# Add legends
plt.legend(loc='upper left')
ax2.legend(loc='upper right')
plt.title('K-means Clustering: Silhouette Score and Elbow Method')
plt.show()
import scipy.cluster.hierarchy as sch
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
# Hierarchical Clustering
linkages = ['ward', 'single', 'complete', 'average']
plt.figure(figsize=(15, 10))
for i, linkage in enumerate(linkages, 1):
plt.subplot(2, 2, i)
Z = sch.linkage(X_scaled, method=linkage)
sch.dendrogram(Z, no_labels=True)
plt.title(f'Dendrogram - {linkage} Linkage')
plt.tight_layout()
plt.show()
# Compute Cophenetic Correlation Coefficient for different linkages
cophenetic_corr = {}
for linkage in linkages:
Z = sch.linkage(X_scaled, method=linkage)
coph_dist, _ = sch.cophenet(Z, sch.distance.pdist(X_scaled))
cophenetic_corr[linkage] = coph_dist
# Print Cophenetic Correlation Coefficients
print("Cophenetic Correlation Coefficients:")
for linkage, corr in cophenetic_corr.items():
print(f"{linkage}: {corr:.4f}")
Cophenetic Correlation Coefficients: ward: 0.7101 single: 0.9232 complete: 0.7873 average: 0.9423
Observation on Optimal Number of Clusters
K-means Clustering:
Hierarchical Clustering:
The Cophenetic Correlation Coefficient measures how well the dendrogram (or hierarchical tree) preserves the pairwise distances between the objects. A higher CCC value generally indicates a better fit.
Analysis:
Conclusion:
Thus, based on both algorithms and their respective methods, 3 clusters is the appropriate number of clusters to use for clustering analysis.
Question 5: You can also mention any differences or similarities you obtained in the cluster profiles from both the clustering techniques.
Differences:
Cluster Overlap:
Execution Time:
Cophenetic Correlation Coefficient:
Similarities:
Optimal Number of Clusters:
Cluster Consistency:
Both K-means and hierarchical clustering techniques agree on the optimal number of clusters (3) and show consistent patterns in larger cluster profiles. However, hierarchical methods demonstrate faster execution times and more distinct clusters, as reflected by the CCC values. While K-means is efficient in certain contexts, hierarchical clustering methods (especially Average Linkage and Centroid Linkage) provide a more accurate representation of the underlying data structure, particularly for distinct cluster preservation.
A. Actionable Insights and Recommendations for the application of the 2 methods of clustering
Optimal Number of Clusters: Both the Elbow Method and Silhouette Score indicate that three clusters are optimal for K-means clustering. This suggests that the most effective clustering solution for this dataset, supported by multiple metrics, involves dividing the data into three distinct groups. Therefore, when applying K-means or hierarchical clustering, use three clusters to achieve the most coherent results.
Hierarchical Clustering Methods: The Average Linkage method demonstrates the highest Cophenetic Correlation Coefficient (CCC) of 0.9423, indicating it best preserves the pairwise distances between objects. Therefore, Average Linkage should be prioritized for hierarchical clustering when aiming for the best fit and accurate clustering structure.
Execution Time Considerations: Hierarchical clustering methods, particularly Single Linkage, Complete Linkage, Average Linkage, and others, exhibit faster execution times compared to K-means clustering. This efficiency suggests that hierarchical clustering can be more practical for larger datasets or when computational resources are a concern.
Cluster Distinctiveness: Average Linkage and Centroid Linkage methods show the highest cophenetic correlation coefficients, implying that these methods maintain the distinctiveness of clusters more effectively than others. If cluster quality and distinctiveness are critical, prioritize these methods for clustering analysis.
Comparison of K-means and Hierarchical Clustering: While K-means is efficient and straightforward, it may not always provide the best results for cluster distinctiveness, especially for non-spherical clusters. Hierarchical methods generally offer better cluster quality but may require more computation time. For datasets where cluster shape and distribution are complex, hierarchical clustering methods may be preferable.
Handling Small Clusters: Both K-means and hierarchical methods exhibit variability in handling smaller clusters. For clusters with fewer observations, hierarchical methods may offer better consistency. Pay attention to how smaller clusters are formed and interpreted across different methods.
Practical Application: When deciding between K-means and hierarchical clustering, consider the trade-off between computational efficiency and clustering accuracy. For faster execution with decent results, hierarchical methods may be preferred. For very large datasets, a hybrid approach or pre-filtering may be needed to manage computational load.
Method Selection for Specific Goals: If preserving distance relationships between data points is crucial (e.g., in applications requiring high fidelity), Average Linkage and Centroid Linkage are recommended. For general clustering tasks with acceptable execution times, Single Linkage can be a viable option due to its speed.
Consistency Across Methods: Across all hierarchical methods compared (Ward, Single, Complete, Average, Centroid, and Median), K-means Cluster 0 aligns most closely with hierarchical Cluster 2, and K-means Cluster 1 shows overlap with hierarchical Clusters 1 and 2. This consistency suggests that K-means and hierarchical methods generally agree on the major cluster structures, but differences in smaller clusters should be carefully analyzed.
Future Analysis: For further refinement, consider evaluating additional clustering metrics or methods, such as the Davies-Bouldin Index or other distance measures, to enhance the robustness of clustering results. Additionally, performing a sensitivity analysis with different numbers of clusters or exploring dimensionality reduction techniques could provide more insights into the clustering structure.
B. actionable insights and recommendations for Trade&Ahead to guide their investment strategy:
By utilizing clustering analysis, Trade&Ahead can help clients diversify their portfolios effectively, balance risk and reward, and align their investment strategies with their financial goals and risk appetite. The insights provided will enable the consultancy firm to recommend stock groups that are well-matched to various client profiles, ensuring smarter and more tailored investment decisions.